CN117876940B

CN117876940B - Video language task execution and model training method, device, equipment, and medium

Info

Publication number: CN117876940B
Application number: CN202410270242.6A
Authority: CN
Inventors: 金良; 赵雅倩; 闫瑞栋; 范宝余; 郭振华; 尹云峰
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2024-03-11
Filing date: 2024-03-11
Publication date: 2024-05-31
Anticipated expiration: 2044-03-11
Also published as: CN117876940A

Abstract

The present invention discloses a method, device, equipment and medium for executing a video language task and training a model thereof, which are applied to the field of video understanding technology. The method comprises inputting a video sample with a text label, a video parameter to be learned and a frame parameter to be learned into a video language model, a visual language pre-training model extracting visual features and parameter features, a video frame adapter converting visual features into frame visual information that meets the requirements of the visual language pre-training model based on the frame parameters to be learned, and a video adapter extracting video visual information based on the video parameters to be learned; the video language model is iteratively updated according to the loss information between the frame visual information, the video visual information and the text semantic features until the preset model training end condition is met. The present invention can solve the problems of slow convergence of the video language model in the related art and time-consuming and resource-consuming training, and can effectively improve the training efficiency of the video language model and save the computing resources required for model training.

Description

Video language task execution and model training method, device, equipment, and medium

技术领域Technical Field

本发明涉及视频理解技术领域，特别是涉及一种视频语言任务执行及其模型训练方法、装置、电子设备及可读存储介质。The present invention relates to the field of video understanding technology, and in particular to a method, device, electronic device and readable storage medium for executing a video language task and training a model thereof.

背景技术Background technique

视频语言模型能够理解视觉模态与语言模态的内在关系，可用于执行视频语言相关的任务，包括但并不限制于视频内容理解和分类任务、视频字幕翻译和生成任务。The video language model can understand the intrinsic relationship between the visual modality and the language modality, and can be used to perform video language-related tasks, including but not limited to video content understanding and classification tasks, video subtitle translation and generation tasks.

相关技术的视频语言模型由于存在视觉模态和文本模态弱相关、文本对视频聚焦范围不同的问题，导致视频语言模型收敛慢，训练耗时耗资源。The video language model in the related technology has the problems of weak correlation between visual modality and text modality and different focus ranges of text on video, which leads to slow convergence of video language model and time-consuming and resource-consuming training.

鉴于此，提升视频语言模型的训练效率，是所属领域技术人员需要解决的技术问题。In view of this, improving the training efficiency of video language models is a technical problem that technical personnel in the field need to solve.

发明内容Summary of the invention

本发明提供了一种视频语言任务执行及其模型训练方法、装置、电子设备及可读存储介质，能够有效提升视频语言模型的训练效率，节省模型训练所需的计算资源。The present invention provides a method, device, electronic device and readable storage medium for executing a video language task and training a model thereof, which can effectively improve the training efficiency of the video language model and save the computing resources required for model training.

为解决上述技术问题，本发明提供以下技术方案：In order to solve the above technical problems, the present invention provides the following technical solutions:

本发明一方面提供了一种视频语言模型训练方法，包括：On one hand, the present invention provides a video language model training method, comprising:

获取携带文本描述标签的视频样本数据集、预先设置的待学习视频参数及待学习帧参数；Obtain a video sample data set carrying text description labels, pre-set video parameters to be learned, and frame parameters to be learned;

将所述视频样本数据集中的视频样本、所述待学习视频参数和所述待学习帧参数输入至所述视频语言模型；所述视频语言模型包括目标视觉语言预训练模型、视频帧适配器和视频适配器；所述目标视觉语言预训练模型用于提取视觉特征和参数特征并将其分别对应输入至所述视频帧适配器和所述视频适配器，所述视频帧适配器用于将所述视觉特征转换为满足所述目标视觉语言预训练模型需求的帧视觉信息，所述视频适配器用于提取视频视觉信息；The video samples in the video sample data set, the video parameters to be learned and the frame parameters to be learned are input into the video language model; the video language model includes a target visual language pre-training model, a video frame adapter and a video adapter; the target visual language pre-training model is used to extract visual features and parameter features and input them into the video frame adapter and the video adapter respectively, the video frame adapter is used to convert the visual features into frame visual information that meets the requirements of the target visual language pre-training model, and the video adapter is used to extract video visual information;

根据所述帧视觉信息、所述视频视觉信息与文本语义特征的损失信息，对所述视频语言模型进行迭代更新，直至满足预设模型训练结束条件；Iteratively updating the video language model according to the frame visual information, the video visual information, and the loss information of the text semantic features until a preset model training end condition is met;

其中，所述待学习视频参数对应的参数特征输入至所述视频适配器，所述待学习帧参数对应的参数特征输入至所述视频帧适配器，以利用所述待学习帧参数获取文本相关的视觉信息。The parameter features corresponding to the video parameters to be learned are input into the video adapter, and the parameter features corresponding to the frame parameters to be learned are input into the video frame adapter, so as to obtain text-related visual information using the frame parameters to be learned.

在第一种示例性的实施方式中，所述待学习帧参数对应的参数特征为帧参数特征，所述文本描述标签包括视频帧描述文本标签，所述视频帧描述文本标签对应的文本语义特征为视频帧文本特征，所述视频帧适配器包括帧输入层、文本编码层、跨模态融合层、特征增强层和帧输出层；In a first exemplary embodiment, the parameter feature corresponding to the frame parameter to be learned is a frame parameter feature, the text description label includes a video frame description text label, the text semantic feature corresponding to the video frame description text label is a video frame text feature, and the video frame adapter includes a frame input layer, a text encoding layer, a cross-modal fusion layer, a feature enhancement layer, and a frame output layer;

其中，所述帧输入层，用于接收所述帧参数特征和所述视频帧文本特征的拼接结果；所述文本编码层，用于基于当前注意力掩码对所述拼接结果进行编码，得到帧参数编码特征；所述跨模态融合层，用于将所述帧参数编码特征和所述视觉特征进行跨模态融合处理；所述特征增强层，用于对融合结果进行特征增强处理，并将增强特征输入至所述文本编码层；所述帧输出层，用于输出帧视觉信息。Among them, the frame input layer is used to receive the splicing result of the frame parameter features and the video frame text features; the text encoding layer is used to encode the splicing result based on the current attention mask to obtain the frame parameter encoding features; the cross-modal fusion layer is used to perform cross-modal fusion processing on the frame parameter encoding features and the visual features; the feature enhancement layer is used to perform feature enhancement processing on the fusion result and input the enhanced features to the text encoding layer; the frame output layer is used to output frame visual information.

在第二种示例性的实施方式中，所述跨模态融合层为跨模态注意力机制层，所述将所述帧参数编码特征和所述视觉特征进行跨模态融合处理，包括：In a second exemplary implementation, the cross-modal fusion layer is a cross-modal attention mechanism layer, and the cross-modal fusion processing of the frame parameter encoding features and the visual features includes:

以所述帧参数编码特征作为查询向量，所述视觉特征作为一组值向量和键向量，基于跨模态注意力机制对所述帧参数编码特征和所述视觉特征进行编码，以作为融合结果。The frame parameter encoding features are used as query vectors, and the visual features are used as a set of value vectors and key vectors. The frame parameter encoding features and the visual features are encoded based on a cross-modal attention mechanism to serve as a fusion result.

在第三种示例性的实施方式中，所述特征增强层包括第一特征增强层、交互特征提取层和第二特征增强层；In a third exemplary embodiment, the feature enhancement layer includes a first feature enhancement layer, an interactive feature extraction layer, and a second feature enhancement layer;

所述第一特征增强层，用于对所述融合结果进行层归一化处理，并通过残差连接，得到第一交互增强特征；The first feature enhancement layer is used to perform layer normalization processing on the fusion result and obtain a first interactive enhancement feature through a residual connection;

所述交互特征提取层，用于对所述第一交互增强特征进行特征提取，得到第二交互增强特征；The interactive feature extraction layer is used to extract the first interactive enhancement feature to obtain a second interactive enhancement feature;

所述第二特征增强层，用于对所述第二交互增强特征进行层归一化处理，并通过残差连接。The second feature enhancement layer is used to perform layer normalization processing on the second interactive enhancement feature and connect it through a residual connection.

在第四种示例性的实施方式中，所述视频帧适配器的训练过程包括：In a fourth exemplary embodiment, the training process of the video frame adapter includes:

提取当前帧对应的帧视觉信息的特征，得到当前帧图像对应的图像帧特征；Extract the features of the frame visual information corresponding to the current frame to obtain the image frame features corresponding to the current frame image;

提取当前帧对应的视频帧文本特征，得到当前帧图像对应的图像帧文本特征；Extract the video frame text features corresponding to the current frame to obtain the image frame text features corresponding to the current frame image;

根据各图像帧特征与对应图像帧文本特征之间的损失信息，对所述视频帧适配器进行迭代更新。The video frame adapter is iteratively updated according to the loss information between each image frame feature and the corresponding image frame text feature.

在第五种示例性的实施方式中，所述根据各图像帧特征与对应图像帧文本特征之间的损失信息，对所述视频帧适配器进行迭代更新，包括：In a fifth exemplary implementation, the iterative updating of the video frame adapter according to the loss information between each image frame feature and the corresponding image frame text feature includes:

通过利用所述视频帧适配器预测图像帧特征和图像帧文本特征为正向匹配还是负向不匹配，确定帧-文本匹配损失；Determining a frame-text matching loss by predicting whether the image frame features and the image frame text features are a positive match or a negative mismatch using the video frame adapter;

通过比对图像帧特征和图像帧文本特征之间的相似性，确定帧-文本对比损失；By comparing the similarity between the image frame features and the image frame text features, the frame-text comparison loss is determined;

掩码掉部分视频帧文本特征，通过基于剩余的视频帧文本特征对应的图像帧文本特征与各图像帧特征训练的视频帧适配器，对掩码掉的视频帧文本特征进行预测，确定文本生成损失；Masking off some video frame text features, predicting the masked video frame text features through the video frame adapter trained based on the image frame text features corresponding to the remaining video frame text features and each image frame feature, and determining the text generation loss;

根据所述帧-文本匹配损失、所述帧-文本对比损失和所述文本生成损失确定所述视频帧适配器的损失函数。A loss function of the video frame adapter is determined according to the frame-text matching loss, the frame-text contrast loss and the text generation loss.

在第六种示例性的实施方式中，所述通过比对图像帧特征和图像帧文本特征之间的相似性，确定帧-文本对比损失，包括：In a sixth exemplary implementation, determining the frame-text comparison loss by comparing the similarity between the image frame feature and the image frame text feature includes:

将正向匹配的图像帧特征和图像帧文本特征作为一组正样本，将负向不匹配的图像帧特征和图像帧文本特征作为一组负样本；The positively matched image frame features and image frame text features are taken as a group of positive samples, and the negatively mismatched image frame features and image frame text features are taken as a group of negative samples;

计算各组正样本中的图像帧特征和图像帧文本特征之间的正相似性，计算各组负样本中的图像帧特征和图像帧文本特征之间的负相似性；Calculate the positive similarity between the image frame features and the image frame text features in each group of positive samples, and calculate the negative similarity between the image frame features and the image frame text features in each group of negative samples;

通过对比所述正相似性和所述负相似性确定帧-文本对比损失。A frame-text contrast loss is determined by comparing the positive similarity and the negative similarity.

在第七种示例性的实施方式中，所述通过比对图像帧特征和图像帧文本特征之间的相似性，确定帧-文本对比损失，包括：In a seventh exemplary implementation, determining the frame-text comparison loss by comparing the similarity between the image frame feature and the image frame text feature includes:

调用对比损失函数关系式，计算帧-文本对比损失；所述对比损失函数关系式为：；The contrast loss function relationship is called to calculate the frame-text contrast loss; the contrast loss function relationship is: ;

式中，Loss _ITG为帧-文本对比损失，exp表示指数函数，Z _i为第i个图像帧特征，T _i为与第i个图像帧特征相匹配的图像帧文本特征，T _j为图像帧文本特征不匹配的第j个图像帧特征，N _ITG为图像帧文本特征与图像帧特征匹配的总数，θ表示图像帧特征和图像帧文本特征之间的相似性，τ为待优化参数。where Loss _ITG is _the frame-text contrast loss, _exp represents the exponential function, Zi is the i -th image frame feature, Ti is the image frame text feature that matches the i -th image frame feature, Tj is the _j - th image frame feature that does not match the image frame text feature, N _ITG is the total number of matches between image frame text features and image frame features, θ represents the similarity between image frame features and image frame text features, and τ is the parameter to be optimized.

在第八种示例性的实施方式中，所述根据所述帧-文本匹配损失、所述帧-文本对比损失和所述文本生成损失确定所述视频帧适配器的损失函数，包括：In an eighth exemplary embodiment, determining the loss function of the video frame adapter according to the frame-text matching loss, the frame-text contrast loss, and the text generation loss includes:

根据所述帧-文本匹配损失、所述帧-文本对比损失和所述文本生成损失确定图像帧-图像帧文本损失；Determine an image frame-image frame text loss according to the frame-text matching loss, the frame-text comparison loss and the text generation loss;

掩码所述视频样本的目标图像帧，通过基于掩码后的视频样本对应的图像帧文本特征与各图像帧特征训练的视频帧适配器，对所述目标图像帧进行预测，确定视频帧掩码损失；Masking a target image frame of the video sample, predicting the target image frame through a video frame adapter trained based on image frame text features corresponding to the masked video sample and features of each image frame, and determining a video frame mask loss;

根据所述图像帧-图像帧文本损失和所述视频帧掩码损失，确定所述视频帧适配器的损失函数。A loss function of the video frame adapter is determined according to the image frame-image frame text loss and the video frame mask loss.

在第九种示例性的实施方式中，所述确定视频帧掩码损失包括：In a ninth exemplary implementation, determining the video frame mask loss includes:

调用视频帧掩码损失函数关系式，计算视频帧掩码损失；所述视频帧掩码损失函数关系式为：The video frame mask loss function relationship is called to calculate the video frame mask loss; the video frame mask loss function relationship is:

； ;

其中，Loss _MTF为视频帧掩码损失，为小批次视频样本内部随机分布的期望，D表示随机分布，V表示图像帧特征，V _m为目标图像帧，O(V _m)为目标图像帧特征，/>为视频样本没有被掩码掉的图像帧特征，T表示/>对应的图像帧文本特征，/>为目标图像帧的图像帧文本特征，k为小批次视频样本内部被掩码掉的第k个图像帧特征，K为小批次视频样本内部被掩码掉的图像数目，model表示预测结果。Among them, Loss _MTF is the video frame mask loss, is the expectation of random distribution within a small batch of video samples, D represents random distribution, V represents image frame features, V _m represents target image frame, O( V _m ) represents target image frame features, /> is the image frame feature of the video sample that is not masked, T represents/> The corresponding image frame text features, /> is the image frame text feature of the target image frame, k is the kth image frame feature that is masked inside the small batch video sample, K is the number of images that are masked inside the small batch video sample, and model represents the prediction result.

在第十种示例性的实施方式中，所述待学习视频参数对应的参数特征为视频参数特征，所述视频适配器包括视频输入层、参数编码器层、特征融合层、特征提取层和视频输出层；In a tenth exemplary embodiment, the parameter features corresponding to the video parameters to be learned are video parameter features, and the video adapter includes a video input layer, a parameter encoder layer, a feature fusion layer, a feature extraction layer, and a video output layer;

其中，所述视频输入层，用于接收所述视觉特征和所述帧视觉信息的联合特征；所述参数编码器层，用于对所述视频参数特征进行编码，得到视频参数编码特征；所述特征融合层，用于将所述视频参数编码特征和所述联合特征进行融合处理；所述特征提取层，用于对融合处理结果进行特征提取，并将提取特征传输至所述参数编码器层；所述视频输出层，用于输出视频视觉信息。Among them, the video input layer is used to receive the joint features of the visual features and the frame visual information; the parameter encoder layer is used to encode the video parameter features to obtain video parameter coding features; the feature fusion layer is used to fuse the video parameter coding features and the joint features; the feature extraction layer is used to extract features from the fusion processing results and transmit the extracted features to the parameter encoder layer; the video output layer is used to output video visual information.

在第十一种示例性的实施方式中，所述特征融合层包括第一视频特征增强层、跨模态学习层和第二视频特征增强层；In an eleventh exemplary embodiment, the feature fusion layer includes a first video feature enhancement layer, a cross-modal learning layer, and a second video feature enhancement layer;

其中，所述第一视频特征增强层，用于对所述视频参数编码特征和所述视频参数特征进行残差连接，并做层归一化处理，得到参数增强特征；The first video feature enhancement layer is used to perform residual connection on the video parameter coding feature and the video parameter feature, and perform layer normalization processing to obtain parameter enhancement features;

所述跨模态学习层，用于基于跨模态注意力机制，以所述参数增强特征为查询向量，所述联合特征作为一组值向量和键向量，对所述视频参数编码特征和所述联合特征进行融合处理，得到多模态融合特征；The cross-modal learning layer is used to fuse the video parameter encoding feature and the joint feature based on a cross-modal attention mechanism, taking the parameter enhancement feature as a query vector and the joint feature as a set of value vectors and key vectors to obtain a multi-modal fusion feature;

所述第二视频特征增强层，对所述多模态融合特征进行残差连接，并做层归一化处理，得到融合处理结果。The second video feature enhancement layer performs residual connection on the multimodal fusion features and performs layer normalization processing to obtain a fusion processing result.

在第十二种示例性的实施方式中，所述视频语言模型还包括对接网络层；所述对接网络层包括第一转换器模型、视频特征提取层和联合层；In a twelfth exemplary embodiment, the video language model further comprises a docking network layer; the docking network layer comprises a first transformer model, a video feature extraction layer and a joint layer;

其中，所述第一转换器模型，用于基于自注意力机制对所述视觉特征进行融合，得到视觉融合特征；所述视频特征提取层，用于对所述视觉融合特征进行特征提取，并将提取的特征的维度转换为与所述视频适配器的输入维度相同的维度；所述联合层，用于将所述帧视觉信息和所述视频特征提取层的输出进行联合，并将联合特征输入至所述视频适配器。Among them, the first converter model is used to fuse the visual features based on the self-attention mechanism to obtain visual fusion features; the video feature extraction layer is used to extract features from the visual fusion features and convert the dimensions of the extracted features into the same dimensions as the input dimensions of the video adapter; the joint layer is used to combine the frame visual information and the output of the video feature extraction layer, and input the joint features into the video adapter.

在第十三种示例性的实施方式中，所述文本描述标签包括视频描述文本标签，所述视频描述文本标签对应的文本语义特征为视频文本特征，所述视频适配器的训练过程包括：In a thirteenth exemplary implementation, the text description tag includes a video description text tag, the text semantic feature corresponding to the video description text tag is a video text feature, and the training process of the video adapter includes:

提取所述视频视觉信息的视频特征；Extracting video features of the video visual information;

提取所述视频文本特征对应的编码文本特征；Extracting encoded text features corresponding to the video text features;

根据所述视频特征和所述编码文本特征之间的损失信息，对所述视频适配器进行迭代更新。The video adapter is iteratively updated according to the loss information between the video feature and the encoded text feature.

在第十四种示例性的实施方式中，所述根据所述视频特征和所述编码文本特征之间的损失信息，包括：In a fourteenth exemplary implementation, the loss information between the video feature and the encoded text feature includes:

调用视频-文本损失计算关系式，计算所述视频适配器的视频-文本损失，所述视频-文本损失计算关系式为：The video-text loss calculation formula is called to calculate the video-text loss of the video adapter, and the video-text loss calculation formula is:

； ;

其中，Loss _G为视频-文本损失，N_G为当前批次中所述视频特征和所述编码文本特征匹配的总数，为当前批次中的第i＇个视频特征，/>为第i＇个视频特征相匹配的编码文本特征，/>为与第i＇个视频特征不匹配的第j＇个编码文本特征，θ表示所述视频特征和所述编码文本特征之间的相似性，τ为待优化参数。Wherein, Loss _G is the video-text loss, _NG is the total number of matches between the video features and the encoded text features in the current batch, is the i 'th video feature in the current batch,/> is the encoded text feature that matches the i 'th video feature,/> is the j'th encoded text feature that does not match the i'th video feature, θ represents the similarity between the video feature and the encoded text feature, and τ is the parameter to be optimized.

在第十五种示例性的实施方式中，所述将视频样本、预先设置的待学习视频参数和待学习帧参数输入至所述视频语言模型，包括：In a fifteenth exemplary implementation, the step of inputting the video sample, the preset video parameters to be learned, and the frame parameters to be learned into the video language model includes:

对所述视频样本进行图像采样处理，得到多帧样本图像；Performing image sampling processing on the video sample to obtain multiple frames of sample images;

利用所述目标视觉语言预训练模型的图像编码器提取各帧样本图像的图像特征，得到视觉特征；Extracting image features of each frame of sample image using an image encoder of the target visual language pre-training model to obtain visual features;

利用所述目标视觉语言预训练模型的文本编码器提取所述视频样本的文本描述标签的文本语义特征；Extracting text semantic features of the text description label of the video sample using the text encoder of the target visual language pre-trained model;

利用所述目标视觉语言预训练模型的文本编码器分别提取所述待学习视频参数和所述待学习帧参数对应的参数特征。The text encoder of the target visual language pre-training model is used to extract parameter features corresponding to the video parameters to be learned and the frame parameters to be learned respectively.

在第十六种示例性的实施方式中，所述利用所述目标视觉语言预训练模型的文本编码器分别提取所述待学习视频参数和所述待学习帧参数对应的参数特征，包括：In a sixteenth exemplary implementation, the text encoder using the target visual language pre-training model extracts parameter features corresponding to the video parameters to be learned and the frame parameters to be learned, respectively, including:

利用所述目标视觉语言预训练模型的文本编码器对所述待学习帧参数进行随机初始化处理，并将所述待学习帧参数的随机初始化结果作为帧参数特征；Using the text encoder of the target visual language pre-training model to randomly initialize the frame parameters to be learned, and using the random initialization results of the frame parameters to be learned as frame parameter features;

利用所述目标视觉语言预训练模型的文本编码器，基于当前注意力掩码对所述待学习视频参数进行编码处理，得到视频参数特征。The text encoder of the target visual language pre-training model is used to encode the video parameters to be learned based on the current attention mask to obtain video parameter features.

在第十七种示例性的实施方式中，所述文本描述标签包括视频描述文本标签和视频帧描述文本标签，所述利用所述目标视觉语言预训练模型的文本编码器提取所述视频样本的文本描述标签的文本语义特征，包括：In a seventeenth exemplary implementation, the text description tag includes a video description text tag and a video frame description text tag, and the text encoder using the target visual language pre-trained model to extract text semantic features of the text description tag of the video sample includes:

利用所述目标视觉语言预训练模型的文本编码器提取所述视频描述文本标签的视频文本特征；Extracting video text features of the video description text tag using a text encoder of the target visual language pre-trained model;

利用所述目标视觉语言预训练模型的文本编码器对所述视频帧描述文本标签进行词元化处理，并对词元化处理结果进行词嵌入处理，得到视频帧文本特征；Using the text encoder of the target visual language pre-trained model to perform lemma processing on the video frame description text label, and performing word embedding processing on the lemma processing result to obtain video frame text features;

利用所述目标视觉语言预训练模型的文本编码器，基于当前注意力掩码对所述视频描述文本标签进行编码处理，得到视频文本特征。The text encoder of the target visual language pre-trained model is used to encode the video description text label based on the current attention mask to obtain video text features.

在第十八种示例性的实施方式中，所述利用所述目标视觉语言预训练模型的图像编码器提取各帧样本图像的图像特征，得到视觉特征，包括：In an eighteenth exemplary implementation, the image encoder using the target visual language pre-trained model extracts image features of each frame of sample image to obtain visual features, including:

将当前帧图像分割为多个内容不重叠的图像块；Divide the current frame image into multiple image blocks with non-overlapping contents;

将各图形块通过线性映射转换为一维表示，同时为相应图像块添加位置编码信息；Convert each graphic block into a one-dimensional representation through linear mapping, and add position encoding information to the corresponding image block;

将经过线性映射和位置编码后的图像块输入至第二转换器模型的编码器，并对所述第二转换器模型的编码器的输出进行特征提取，得到所述视频样本的视觉特征。The image block after linear mapping and position encoding is input into the encoder of the second transformer model, and feature extraction is performed on the output of the encoder of the second transformer model to obtain the visual features of the video sample.

在第十九种示例性的实施方式中，所述待学习帧参数对应的参数特征为帧参数特征，所述文本描述标签包括视频帧描述文本标签和视频描述文本标签，所述视频帧描述文本标签对应的文本语义特征为视频帧文本特征，所述视频描述文本标签对应的文本语义特征为视频文本特征，所述视频语言模型的训练过程，包括：In a nineteenth exemplary implementation, the parameter feature corresponding to the frame parameter to be learned is a frame parameter feature, the text description label includes a video frame description text label and a video description text label, the text semantic feature corresponding to the video frame description text label is a video frame text feature, the text semantic feature corresponding to the video description text label is a video text feature, and the training process of the video language model includes:

将所述视频帧描述文本标签、所述待学习帧参数、所述视频样本数据集作为输入，通过冻结所述目标视觉语言预训练模型的图像编码器，利用所述待学习帧参数获取所述视频帧描述文本标签对应的视觉信息的方式，训练所述视频帧适配器；Taking the video frame description text label, the frame parameters to be learned, and the video sample data set as input, freezing the image encoder of the target visual language pre-trained model, and using the frame parameters to be learned to obtain visual information corresponding to the video frame description text label, the video frame adapter is trained;

当所述视频帧适配器训练完成，将所述视频帧描述文本标签、所述视频帧描述文本标签、所述待学习帧参数、所述待学习视频参数、所述视频样本数据集作为输入，训练所述视频适配器。When the video frame adapter training is completed, the video frame description text label, the video frame description text label, the frame parameters to be learned, the video parameters to be learned, and the video sample data set are used as input to train the video adapter.

在第二十种示例性的实施方式中，所述训练所述视频适配器之前，还包括：In a twentieth exemplary implementation, before training the video adapter, the method further includes:

当接收到学习率调整指令，根据所述学习率调整指令的新学习率更新当前学习率；所述新学习率小于所述当前学习率。When a learning rate adjustment instruction is received, the current learning rate is updated according to a new learning rate of the learning rate adjustment instruction; the new learning rate is less than the current learning rate.

在第二十一种示例性的实施方式中，所述训练所述视频帧适配器，包括：In a twenty-first exemplary implementation, the training of the video frame adapter includes:

调用视频帧适配器损失函数，训练所述视频帧适配器；所述视频帧适配器损失函数为：Call the video frame adapter loss function to train the video frame adapter; the video frame adapter loss function is:

； ;

其中，Loss _frame表示视频帧适配器损失函数，Loss _ITM为帧-文本匹配损失，Loss _ITC为文本生成损失，Loss _ITG为帧-文本对比损失，Loss _MEF为视频帧掩码损失，α ₀为帧-文本匹配损失系数，α ₁为文本生成损失系数，α ₂为帧-文本对比损失系数，β为视频帧掩码损失系数。Among them, Loss _frame represents _the video frame adapter loss function, Loss _ITM is the frame-text matching loss, Loss _ITC is the text generation loss, Loss _ITG is the frame-text contrast loss, Loss _MEF is the video frame mask loss, α0 is the frame-text matching loss coefficient, α1 is the text generation loss coefficient, α2 is the frame-text contrast loss coefficient, _and β is _the video frame mask loss coefficient.

在第二十二种示例性的实施方式中，所述训练所述视频帧适配器，包括：In a twenty-second exemplary implementation, the training of the video frame adapter includes:

调用视频语言损失函数，训练所述视频适配器；所述视频语言损失函数为：Calling a video language loss function to train the video adapter; the video language loss function is:

； ;

其中，Loss表示视频语言损失函数，α为视频帧适配器损失函数系数，Loss _G为视频-文本损失，γ为视频-文本损失函数系数。Among them, Loss represents the video language loss function, α is the coefficient of the video frame adapter loss function, Loss _G is the video-text loss, and γ is the coefficient of the video-text loss function.

本发明第二方面提供了一种视频语言任务执行方法，包括：A second aspect of the present invention provides a video language task execution method, comprising:

利用如前任意一项所述的视频语言模型训练方法，训练得到视频语言模型；Using any of the above video language model training methods, a video language model is trained to obtain a video language model;

获取待执行视频语言任务和对应的视频语言任务训练样本集；Obtaining a video language task to be performed and a corresponding video language task training sample set;

基于所述视频语言任务，利用所述视频语言任务训练样本集对所述视频语言模型进行微调；Based on the video language task, fine-tuning the video language model using the video language task training sample set;

利用微调后的视频语言模型执行所述视频语言任务。The video language task is performed using the fine-tuned video language model.

在第一种示例性的实施方式中，所述待执行视频语言任务为视频内容理解任务，所述视频语言任务训练样本集为携带视频内容标签的多个视频样本的视频样本集；所述基于所述视频语言任务，利用所述视频语言任务训练样本集对所述视频语言模型进行微调，包括：In a first exemplary implementation, the video language task to be performed is a video content understanding task, and the video language task training sample set is a video sample set of multiple video samples carrying video content labels; based on the video language task, fine-tuning the video language model using the video language task training sample set includes:

基于所述视频内容理解任务，利用所述视频样本集对所述视频语言模型进行微调，以利用微调后的视频语言模型执行所述视频内容理解任务。Based on the video content understanding task, the video language model is fine-tuned using the video sample set, so as to perform the video content understanding task using the fine-tuned video language model.

本发明第三方面提供了一种视频语言模型训练装置，包括：A third aspect of the present invention provides a video language model training device, comprising:

数据获取模块，用于获取携带文本描述标签的视频样本数据集、预先设置的待学习视频参数及待学习帧参数；A data acquisition module is used to acquire a video sample data set carrying text description labels, pre-set video parameters to be learned, and frame parameters to be learned;

输入数据处理模块，用于将所述视频样本数据集中的视频样本、所述待学习视频参数和所述待学习帧参数输入至所述视频语言模型；所述视频语言模型包括目标视觉语言预训练模型、视频帧适配器和视频适配器；所述目标视觉语言预训练模型用于提取视觉特征和参数特征并将其分别对应输入至所述视频帧适配器和所述视频适配器，所述视频帧适配器用于将所述视觉特征转换为满足所述目标视觉语言预训练模型需求的帧视觉信息，所述视频适配器用于提取视频视觉信息；其中，所述待学习视频参数对应的参数特征输入至所述视频适配器，所述待学习帧参数对应的参数特征输入至所述视频帧适配器，以利用所述待学习帧参数获取文本相关的视觉信息；An input data processing module is used to input the video samples in the video sample data set, the video parameters to be learned and the frame parameters to be learned into the video language model; the video language model includes a target visual language pre-training model, a video frame adapter and a video adapter; the target visual language pre-training model is used to extract visual features and parameter features and input them into the video frame adapter and the video adapter respectively, the video frame adapter is used to convert the visual features into frame visual information that meets the requirements of the target visual language pre-training model, and the video adapter is used to extract video visual information; wherein the parameter features corresponding to the video parameters to be learned are input into the video adapter, and the parameter features corresponding to the frame parameters to be learned are input into the video frame adapter, so as to obtain text-related visual information using the frame parameters to be learned;

模型参数更新模块，用于根据所述帧视觉信息、所述视频视觉信息与文本语义特征的损失信息，对所述视频语言模型进行迭代更新，直至满足预设模型训练结束条件。The model parameter updating module is used to iteratively update the video language model according to the frame visual information, the video visual information and the loss information of the text semantic features until the preset model training end condition is met.

本发明第四方面提供了一种视频语言任务执行装置，包括：A fourth aspect of the present invention provides a video language task execution device, comprising:

模型训练模块，用于利用如前任意一项所述的视频语言模型训练方法，训练得到视频语言模型；A model training module, used to train a video language model using the video language model training method described in any one of the above items;

数据获取模块，用于获取待执行视频语言任务和对应视频语言任务样本集；A data acquisition module, used to acquire the video language task to be performed and the corresponding video language task sample set;

模型微调模块，用于基于所述视频语言任务，利用所述视频语言任务样本集对所述视频语言模型进行微调；A model fine-tuning module, used to fine-tune the video language model based on the video language task using the video language task sample set;

任务执行模块，用于利用微调后的视频语言模型执行所述视频语言任务。The task execution module is used to execute the video language task using the fine-tuned video language model.

本发明第五方面还提供了一种电子设备，包括处理器，所述处理器用于执行存储器中存储的计算机程序时实现如前任一项所述视频语言模型训练方法的步骤。The fifth aspect of the present invention also provides an electronic device, comprising a processor, wherein the processor is used to implement the steps of the video language model training method as described in any of the preceding items when executing a computer program stored in a memory.

本发明最后还提供了一种可读存储介质，所述可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现如前任一项所述视频语言模型训练方法的步骤。Finally, the present invention also provides a readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of the video language model training method as described in any of the preceding items are implemented.

本发明提供的技术方案的优点在于，利用视频帧适配器将视频样本的视觉特征转换为满足视觉语言预训练模型需求的帧视觉信息，利用待学习帧参数学习文本相关的视觉信息，协助模型在不同帧之间建立关联以及引导模型关注不同视觉信息，能够解决视频模态与语言模态弱相关问题，实现视觉语言预训练模型到视频帧的适配。利用视频适配器能够整合帧文本视觉信息，解决语言模态聚焦范围的问题，利用待学习视频参数辅助模型在整个视频序列上建立语义对应关系和理解视频的全局信息，将全局视频信息和帧文本视觉信息进行整合，解决帧文本视觉信息因注意力偏差而导致的信息丢失，从而充分利用视频在局部细节和全局语义等不同层面的视觉信息，提升模型对视频的视觉理解能力，提升视频语言模型的表征能力，使得视频语言模型在训练过程快速收敛，提升视频语言模型的训练效率，节省训练所需计算资源。进一步的，在现有视觉语言预训练模型的基础上，通过加入视频帧适配器和视频适配器构建视频语言模型，无需对原来的视觉语言预训练模型结构进行改变，无需重新设计所有网络结构，不需要大量的视频文本数据重新训练视频语言模型，便能实现将视觉语言预训练模型丰富的视觉表征以及强大的跨模态交互能力的迁移到视频语言任务中，不仅保留原有模型性能，还增强了其扩展性和灵活性。The technical solution provided by the present invention has the advantages of using a video frame adapter to convert the visual features of a video sample into frame visual information that meets the requirements of a visual language pre-training model, using the frame parameters to be learned to learn text-related visual information, assisting the model in establishing associations between different frames and guiding the model to focus on different visual information, and being able to solve the problem of weak correlation between video modality and language modality, and realizing the adaptation of the visual language pre-training model to the video frame. The video adapter can be used to integrate frame text visual information, solve the problem of the focus range of the language modality, use the video parameters to be learned to assist the model in establishing semantic correspondences and understanding the global information of the video on the entire video sequence, integrate the global video information and frame text visual information, and solve the information loss of frame text visual information due to attention bias, thereby making full use of the visual information of the video at different levels such as local details and global semantics, improving the model's visual understanding ability of the video, and improving the representation ability of the video language model, so that the video language model converges quickly during the training process, improving the training efficiency of the video language model, and saving the computing resources required for training. Furthermore, based on the existing visual language pre-training model, a video language model is constructed by adding video frame adapters and video adapters. There is no need to change the original visual language pre-training model structure, no need to redesign all network structures, and no need to retrain the video language model with a large amount of video text data. It is possible to migrate the rich visual representation and powerful cross-modal interaction capabilities of the visual language pre-training model to video language tasks, which not only retains the performance of the original model, but also enhances its scalability and flexibility.

此外，本发明还针对视频语言模型训练方法提供了相应的视频语言任务执行方法、实现装置、电子设备及可读存储介质，进一步使得该方法更具有实用性，视频语言任务执行方法、装置、电子设备及可读存储介质具有相应的优点。In addition, the present invention also provides a corresponding video language task execution method, implementation device, electronic device and readable storage medium for the video language model training method, which further makes the method more practical. The video language task execution method, device, electronic device and readable storage medium have corresponding advantages.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性的，并不能限制本发明。It is to be understood that the foregoing general description and the following detailed description are exemplary only and are not restrictive of the invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚的说明本发明或相关技术的技术方案，下面将对实施例或相关技术描述中所需要使用的附图作简单的介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the present invention or related technologies, the following briefly introduces the drawings required for use in the embodiments or related technical descriptions. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明提供的一种视频语言模型训练方法的流程示意图；FIG1 is a flow chart of a video language model training method provided by the present invention;

图2为本发明提供的在一示例性应用场景的视频帧适配器的一种结构示意图；FIG2 is a schematic diagram of a structure of a video frame adapter provided by the present invention in an exemplary application scenario;

图3为本发明提供的在一示例性应用场景的视频帧适配器的训练流程示意图；FIG3 is a schematic diagram of a training process of a video frame adapter in an exemplary application scenario provided by the present invention;

图4为本发明提供的在一示例性应用场景的视频适配器的一种结构示意图；FIG4 is a schematic diagram of a structure of a video adapter provided by the present invention in an exemplary application scenario;

图5为本发明提供的在一示例性应用场景的视频适配器与对接网络的一种结构示意图；FIG5 is a schematic diagram of a structure of a video adapter and a docking network provided by the present invention in an exemplary application scenario;

图6为本发明提供的在一示例性应用场景的视频适配器的训练流程示意图；FIG6 is a schematic diagram of a training process of a video adapter in an exemplary application scenario provided by the present invention;

图7为本发明提供的一种视频语言任务执行方法的流程示意图；FIG7 is a flow chart of a method for executing a video language task provided by the present invention;

图8为本发明提供的视频语言任务执行方法在一示例性应用场景的硬件框架示意图；FIG8 is a schematic diagram of a hardware framework of a video language task execution method provided by the present invention in an exemplary application scenario;

图9为本发明提供的在一示例性应用场景的视频语言模型的结构示意图；FIG9 is a schematic diagram of the structure of a video language model provided by the present invention in an exemplary application scenario;

图10为本发明提供的视频语言模型训练装置的一种具体实施方式结构图；FIG10 is a structural diagram of a specific implementation of the video language model training device provided by the present invention;

图11为本发明提供的视频语言任务执行装置的一种具体实施方式结构图；FIG11 is a structural diagram of a specific implementation of the video language task execution device provided by the present invention;

图12为本发明提供的电子设备的一种具体实施方式结构图。FIG. 12 is a structural diagram of a specific implementation of an electronic device provided by the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明的技术方案，下面结合附图和具体实施方式对本发明作进一步的详细说明。其中，说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等是用于区别不同的对象，而不是用于描述特定的顺序。此外术语“包括”和“具有”以及二者的任何变形，意图在于覆盖不排他的包含。术语“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。In order to enable those skilled in the art to better understand the technical solution of the present invention, the present invention is further described in detail below in conjunction with the accompanying drawings and specific embodiments. Among them, the terms "first", "second", "third", "fourth", etc. in the specification and the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific order. In addition, the terms "including" and "having" and any variations of the two are intended to cover non-exclusive inclusions. The term "exemplary" means "used as an example, embodiment or illustrative". Any embodiment described here as "exemplary" is not necessarily interpreted as being superior or better than other embodiments.

视频语言模型为能够深度理解视觉模态与语言模态内在关系的跨模态模型，其被广泛应用于视频语言相关的各种应用场景中，举例来说，在搜索和标注视频的应用场景下，视频语言模型能够帮助用户快速定位与理解视频内容。借助视频语言模型精准分析用户兴趣和偏好的能力，还可推动视频摘要生成和个性化推荐系统的发展。此外，视频语言模型还理解实际场景、物体、动作以及情感等视频内容，有利于自然语言描述相关领域的发展，如标题、字幕和故事情节的生成。对于视频问答和对话的应用场景，视频语言模型能够让计算机理解并精准响应视频相关的问题。The video language model is a cross-modal model that can deeply understand the intrinsic relationship between the visual modality and the language modality. It is widely used in various application scenarios related to video language. For example, in the application scenarios of searching and annotating videos, the video language model can help users quickly locate and understand the video content. With the ability of the video language model to accurately analyze user interests and preferences, it can also promote the development of video summary generation and personalized recommendation systems. In addition, the video language model also understands video content such as actual scenes, objects, actions, and emotions, which is conducive to the development of related fields of natural language description, such as the generation of titles, subtitles, and storylines. For the application scenarios of video question-answering and dialogue, the video language model enables computers to understand and accurately respond to video-related questions.

但是，目前视频语言模型存在语言模态聚焦范围的问题，也即其关注的视频的范围不同，如文本描述可能关注某些帧、某一段视频、整个视频，这就导致视频语言模型的训练收敛慢，且最终视频语言任务的执行结果不符合用户的需求。此外，目前视频语言模型的视频评文本描述简短，视频文本描述的稀疏性导致建模视频特征与文本描述之间的关联变得更加困难，视频语言模型的训练收敛慢，视频语言模型性能不好。目前视频语言模型训练过程中，样本视频中的视觉信息和描述中的语义信息之间存在不一致。视频中的场景可能无法与描述中的特定语义概念建立明确的对应关系，导致描述与视频之间的相关性减弱，视频语言模型的训练收敛慢，视频语言模型性能无法满足用户的需求。进一步来说，视频语言任务通常需要训练复杂的深度学习模型，这些模型具有大量的参数和层级结构，导致模型训练过程所需的计算和存储资源的需求增加，而且在一定该范围内，模型的结构和训练样本规模对模型性能也是呈正相关关系，为了获取性能好的视频语言模型，模型的训练和调优需要进行多次迭代和实验，进一步增加了训练时间和计算资源的消耗。However, the current video language model has the problem of language modality focus range, that is, the range of videos it focuses on is different. For example, the text description may focus on certain frames, a certain video segment, or the entire video. This leads to slow training convergence of the video language model, and the final execution results of the video language task do not meet the needs of users. In addition, the video review text description of the current video language model is short. The sparsity of the video text description makes it more difficult to model the association between video features and text descriptions. The training convergence of the video language model is slow, and the performance of the video language model is poor. In the current video language model training process, there is an inconsistency between the visual information in the sample video and the semantic information in the description. The scene in the video may not be able to establish a clear correspondence with the specific semantic concepts in the description, resulting in a weakened correlation between the description and the video. The training convergence of the video language model is slow, and the performance of the video language model cannot meet the needs of users. Furthermore, video language tasks usually require the training of complex deep learning models. These models have a large number of parameters and hierarchical structures, which increases the demand for computing and storage resources required for the model training process. Moreover, within a certain range, the structure of the model and the size of the training samples are positively correlated with the model performance. In order to obtain a video language model with good performance, the training and tuning of the model requires multiple iterations and experiments, which further increases the training time and the consumption of computing resources.

由此可见，相关技术中的视频语言模型存在视频与语言弱相关、文本对视频聚焦范围不同的问题导致视频语言模型收敛慢，训练耗时耗资源的问题。此外，相关技术中的视频语言预训练模型需要重新设计所有网络结构，随后通过大量的视频文本数据重新训练模型，首先提取视频特征和文本特征，然后通过对比学习或者基于attention（注意力）机制的Transform（转换器模型）将视觉信息和文本信息映射到统一的语义空间，这种从头开始训练模型，即耗时又耗力，通常需要大量的计算资源，并且灵活性和扩展性都比较差。It can be seen that the video language model in the related art has the problem that the video and language are weakly correlated, and the text has different focus ranges on the video, which leads to slow convergence of the video language model and time-consuming and resource-consuming training. In addition, the video language pre-training model in the related art requires redesigning all network structures, and then retraining the model with a large amount of video text data. First, the video features and text features are extracted, and then the visual information and text information are mapped to a unified semantic space through contrastive learning or Transform (converter model) based on the attention mechanism. This kind of training model from scratch is time-consuming and labor-intensive, usually requires a lot of computing resources, and has poor flexibility and scalability.

鉴于此，为了解决了相关技术中的视频语言预训练耗时耗资源、视频与语言弱相关、语言模态聚焦视频模态范围的问题，本发明基于相关技术中的视觉语言预训练模型基础上，不改变原有模型参数的情况下，通过添加视频帧适配器、视频适配器、所有视频帧共享一待学习帧参数、待学习视频参数少量参数，便可完成视频语言预训练模型适配，实现将视觉语言预训练模型丰富的视觉表征以及强大的跨模态交互能力的迁移适配至视频语言任务，即保留了原有模型性能，同时又增强了其扩展性和灵活性，还能够有效提升视频语言模型的训练效率，节省模型训练所需的计算资源。在介绍了本发明的技术方案后，下面详细的说明本发明的各种非限制性实施方式。为了更好的说明本发明，在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解，没有这些具体细节，本发明同样可以实施。在另外一些实例中，对于所属领域技术人员熟知的方法和手段未作详细描述，以便于凸显本发明的主旨。In view of this, in order to solve the problems of time-consuming and resource-consuming video language pre-training, weak correlation between video and language, and language modality focusing on the range of video modality in the related art, the present invention is based on the visual language pre-training model in the related art. Without changing the original model parameters, by adding a video frame adapter, a video adapter, all video frames share a frame parameter to be learned, and a small number of video parameters to be learned, the video language pre-training model adaptation can be completed, and the migration of the rich visual representation of the visual language pre-training model and the powerful cross-modal interaction ability can be adapted to the video language task, that is, the original model performance is retained, while its scalability and flexibility are enhanced, and the training efficiency of the video language model can be effectively improved, and the computing resources required for model training can be saved. After introducing the technical solution of the present invention, various non-limiting embodiments of the present invention are described in detail below. In order to better illustrate the present invention, numerous specific details are given in the specific embodiments below. Those skilled in the art should understand that the present invention can also be implemented without these specific details. In some other examples, the methods and means well known to those skilled in the art are not described in detail to highlight the main purpose of the present invention.

首先请参见图1，图1为本实施例提供的一种视频语言模型训练方法的流程示意图，本实施例可包括以下内容：First, please refer to FIG. 1, which is a flow chart of a video language model training method provided by this embodiment. This embodiment may include the following contents:

S101：获取携带文本描述标签的视频样本数据集、预先设置的待学习视频参数及待学习帧参数。S101: Obtain a video sample data set carrying text description labels, pre-set video parameters to be learned, and frame parameters to be learned.

在本实施例中，视频样本数据集为训练视频语言模型使用的训练样本数据集，其可包含大量涵盖多种丰富场景的视频样本，而至于视频样本数量可根据实际需求灵活决定，这均不影响本发明的实现。其中，每个视频样本具有文本描述标签，文本描述标签用于描述该视频样本，其为文本数据，视频样本数据集中的视频样本均预先进行文本描述的标注，标注的信息即为文本描述标签。In this embodiment, the video sample data set is a training sample data set used for training the video language model, which may include a large number of video samples covering a variety of rich scenes, and the number of video samples can be flexibly determined according to actual needs, which does not affect the implementation of the present invention. Each video sample has a text description label, which is used to describe the video sample, which is text data. The video samples in the video sample data set are pre-labeled with text descriptions, and the labeled information is the text description label.

其中，待学习帧参数为在视频语言模型训练过程中，使得视频帧能够学习文本相关的视觉信息的一个变量或一组变量，其可根据具体实际应用场景灵活确定，待学习帧参数为视频帧的可学习参数，所有的视频帧共享一个待学习帧参数，共享的待学习帧参数有助于视频语言模型去关联和理解不同视频帧之间的相关性。在视频中，不同帧可能包含同一个人、同一场景或具有关联性的物体。通过共享待学习帧参数，能够协助模型在不同帧之间建立关联的能力。例如，辅助其他帧找到相关的人物或物体，从而提升视频语言模型在多帧视频理解和分析中的准确性和连贯性。例外，由于通过待学习帧参数能够获取到文本对应视觉信息，不同帧可能会得到不同的视觉信息，相当于提供了不同的视角和信息，通过不同帧之间共享待学习帧参数，能够引导视频语言模型去关注这些信息，同时还可以综合利用这些互补的信息，提高视频语言模型对视频语言任务的理解和表示能力。待学习视频参数同样为在视频语言模型训练过程中，使得视频能够学习获取到有关文本的视觉信息的一个变量或一组变量，其可根据具体实际应用场景灵活确定，待学习视频参数为视频的可学习参数，通过该参数能够辅助视频语言模型在整个视频序列上建立语义对应关系和理解视频的全局信息。由于待学习视频参数能够使得视频语言模型在整个视频序列上建立语义对应关系和理解视频的全局信息，而待学习帧参数能够使得视频语言模型在视频帧级别上进行局部对齐和理解。通过联合待学习视频参数和待学习帧参数，实现跨时间尺度的细粒度对齐，充分利用视频在局部细节和全局语义等不同层面的视觉信息，提高视频语言模型对待处理视频或视频样本的全局和局部特征的理解和表达能力，提升视频语言模型对待处理视频或视频样本的视觉理解能力，提升视频语言模型的表征能力。Among them, the frame parameter to be learned is a variable or a set of variables that enables the video frame to learn text-related visual information during the video language model training process. It can be flexibly determined according to the specific actual application scenario. The frame parameter to be learned is a learnable parameter of the video frame. All video frames share a frame parameter to be learned. The shared frame parameter to be learned helps the video language model to associate and understand the correlation between different video frames. In a video, different frames may contain the same person, the same scene, or related objects. By sharing the frame parameter to be learned, the model can be assisted in establishing associations between different frames. For example, it assists other frames in finding related people or objects, thereby improving the accuracy and coherence of the video language model in multi-frame video understanding and analysis. In addition, since the visual information corresponding to the text can be obtained through the frame parameter to be learned, different frames may obtain different visual information, which is equivalent to providing different perspectives and information. By sharing the frame parameter to be learned between different frames, the video language model can be guided to pay attention to this information. At the same time, these complementary information can be comprehensively utilized to improve the video language model's understanding and representation capabilities for video language tasks. The video parameters to be learned are also a variable or a set of variables that enable the video to learn and obtain visual information about the text during the video language model training process. They can be flexibly determined according to specific practical application scenarios. The video parameters to be learned are learnable parameters of the video, which can assist the video language model in establishing semantic correspondences and understanding the global information of the video over the entire video sequence. Since the video parameters to be learned enable the video language model to establish semantic correspondences and understand the global information of the video over the entire video sequence, and the frame parameters to be learned enable the video language model to perform local alignment and understanding at the video frame level. By combining the video parameters to be learned and the frame parameters to be learned, fine-grained alignment across time scales is achieved, and the visual information of the video at different levels such as local details and global semantics is fully utilized, so as to improve the video language model's ability to understand and express the global and local features of the video or video sample to be processed, improve the video language model's visual understanding ability of the video or video sample to be processed, and improve the representation ability of the video language model.

S102：将视频样本数据集中的视频样本、待学习视频参数和待学习帧参数输入至视频语言模型。S102: Inputting video samples, to-be-learned video parameters, and to-be-learned frame parameters in the video sample data set into a video language model.

在本步骤中，视频语言模型是基于目标视觉语言预训练模型、视频帧适配器和视频适配器预先构建的预训练模型框架，也即视频语言模型的模型结构包括目标视觉语言预训练模型、视频帧适配器和视频适配器。目标视觉语言预训练模型用于提取视觉特征和参数特征并将其分别对应输入至视频帧适配器和视频适配器，目标视觉语言预训练模型为能够同时理解和融合视觉和语言信息，实现跨模态的视觉与语言的预训练模型，目标视觉语言预训练模型通过在大规模的图像-文本对数据上进行预训练，能够学习到视觉信息与语言信息之间的关联，从而使得模型能够同时处理和理解图像和文本信息，并在多个视觉和语言任务中表现出良好的性能和泛化能力。In this step, the video language model is a pre-trained model framework pre-built based on the target visual language pre-trained model, the video frame adapter and the video adapter, that is, the model structure of the video language model includes the target visual language pre-trained model, the video frame adapter and the video adapter. The target visual language pre-trained model is used to extract visual features and parameter features and input them into the video frame adapter and the video adapter respectively. The target visual language pre-trained model is a pre-trained model that can simultaneously understand and fuse visual and language information to achieve cross-modal vision and language. The target visual language pre-trained model can learn the relationship between visual information and language information by pre-training on large-scale image-text pair data, so that the model can simultaneously process and understand image and text information, and show good performance and generalization ability in multiple visual and language tasks.

本实施例的目标视觉语言预训练模型可采用相关技术中任何一种VLP（Vision-Language Pretraining，视觉语言预训练）模型，包括但并不限制于ViLBERT（vision andlanguage bert（Bidirectional Encoder Representations from Transformers，来自转换器的双向编码器表征量），视觉语言bert）模型、LXMERT（Language-Visual Multimodalbert，视觉与语言理解的跨模态bert框架）、Video（视频）BERT模型、Visual（视觉）BERT模型、UniVL（Unifified Video and Language Pre-Training Model，视频语言预训练）模型、UNITER（Universal Image-Text Representation，通用图像文本表示）、CLIP（ContrastiveLanguage-Image Pretraining，对比语言-图像预训练）模型、OSCAR（Object-SemanticsAligned Pretraining，对象语义对齐预训练）模型、MOCA（Montreal CognitiveAssessment，蒙特利尔认知评估）模型。其中，VideoBERT模型为一种联合模型，可用于视频和语言表示学习。通过预训练过程，可以从视频和与之相关的文本中提取和理解信息。该类模型允许模型理解视频中的视觉概念以及这些概念如何与自然语言描述相关联。UniVL模型为用于多模态理解和生成的统一的视频和语言预训练模型，其通过联合优化理解和生成任务，实现了视频和语言的统一表示。这种方法使模型能够理解和生成与视频相关的语言描述。VisualBERT模型为用于视频和语言学习的一个简单而高效的基准模型，其结合了视觉和语言信息，并通过预训练和微调任务来学习多模态表示。MOCA模型通过记忆增强的对比学习方法进行视频和文本表示学习，其构建视频和文本之间的正负对进行自监督训练，并利用记忆机制增强了对比学习的效果。ViLBERT模型通过联合训练图像编码器和文本编码器，将学习到图像和文本之间的相互关系，实现了图像和文本的跨模态理解和推理。UNITER模型在大规模图像-文本对数据上进行预训练，学习到视觉和语言的表示，并具备多模态理解和生成的能力。CLIP模型通过对比学习进行预训练的视觉语言预训练模型，其通过学习图像和文本的相似性来训练模型，使其具备强大的图像和文本匹配能力，并能够进行图像和文本的相关性判断。OSCAR模型为以对象和语义对齐为目标的VLP模型，其通过联合训练视觉和语言编码器，学习到视觉和语言的表示，并能够实现图像和文本之间的对齐和关联。The target visual language pre-training model of this embodiment may adopt any VLP (Vision-Language Pretraining) model in the related art, including but not limited to ViLBERT (Vision and Language Bert (Bidirectional Encoder Representations from Transformers), Visual Language Bert) model, LXMERT (Language-Visual Multimodal Bert, cross-modal Bert framework for visual and language understanding), Video BERT model, Visual BERT model, UniVL (Unified Video and Language Pre-Training Model, video language pre-training) model, UNITER (Universal Image-Text Representation, universal image text representation), CLIP (Contrastive Language-Image Pretraining) model, OSCAR (Object-Semantics Aligned Pretraining, object semantics alignment pre-training) model, MOCA (Montreal Cognitive Assessment, Montreal cognitive assessment) model. Among them, the VideoBERT model is a joint model that can be used for video and language representation learning. Through the pre-training process, information can be extracted and understood from videos and related texts. This type of model allows the model to understand visual concepts in videos and how these concepts are associated with natural language descriptions. The UniVL model is a unified video and language pre-training model for multimodal understanding and generation. It achieves a unified representation of video and language by jointly optimizing understanding and generation tasks. This approach enables the model to understand and generate language descriptions related to videos. The VisualBERT model is a simple and efficient baseline model for video and language learning. It combines visual and language information and learns multimodal representations through pre-training and fine-tuning tasks. The MOCA model learns video and text representations through a memory-enhanced contrastive learning method. It constructs positive and negative pairs between videos and texts for self-supervised training and uses a memory mechanism to enhance the effect of contrastive learning. The ViLBERT model learns the relationship between images and texts by jointly training image encoders and text encoders, achieving cross-modal understanding and reasoning of images and texts. The UNITER model is pre-trained on large-scale image-text pair data, learns visual and language representations, and has the ability to understand and generate multimodal information. The CLIP model is a visual language pre-training model that is pre-trained through contrastive learning. It trains the model by learning the similarity between images and texts, giving it a strong ability to match images and texts, and can judge the relevance of images and texts. The OSCAR model is a VLP model that aims at object and semantic alignment. It learns visual and language representations by jointly training visual and language encoders, and can achieve alignment and association between images and texts.

在本步骤中，当按照预先设置的训练参数从视频样本数据集中选择相应的数量视频样本输入至视频语言模型后，目标视觉语言预训练模型的图像编码器用于提取视频样本的视觉特征，目标视觉语言预训练模型的文本编码器用于提取视频样本的文本描述标签的文本语义特征，并提取待学习视频参数及待学习帧参数对应的参数特征。目标视觉语言预训练模型的图像编码器将提取的视觉特征分别传输至视频帧适配器和视频适配器，将待学习视频参数对应的参数特征输入至视频适配器，目标视觉语言预训练模型的文本编码器将待学习帧参数对应的参数特征输入至视频帧适配器。In this step, after selecting a corresponding number of video samples from the video sample data set according to the pre-set training parameters and inputting them into the video language model, the image encoder of the target visual language pre-training model is used to extract the visual features of the video samples, and the text encoder of the target visual language pre-training model is used to extract the text semantic features of the text description labels of the video samples, and extract the parameter features corresponding to the video parameters to be learned and the frame parameters to be learned. The image encoder of the target visual language pre-training model transmits the extracted visual features to the video frame adapter and the video adapter respectively, and inputs the parameter features corresponding to the video parameters to be learned into the video adapter, and the text encoder of the target visual language pre-training model inputs the parameter features corresponding to the frame parameters to be learned into the video frame adapter.

在本实施例中，由于目标视觉语言预训练模型的图像和文本常常存在弱相关，而这种弱相关在视频语言数据中更严重，利用目标视觉语言预训练模型虽能够提取丰富语义信息图像特征，但这些图像特征具备强大的语义一致性和泛化性能，但仍无法满足适配视频帧，所以需要通过视频帧适配器转化成满足视频语言预训练模型需求的帧特征。换言之，视频帧适配器根据接收到的视觉特征，通过强迫待学习帧参数获取文本相关的视觉信息，将视觉特征转换为满足目标视觉语言预训练模型需求的帧视觉信息。可以理解的是，视频帧适配器所提取的帧文本视觉信息为帧级别的细粒度的信息，缺乏对整个视频理解，在此基础上，视频适配器联合目标视觉语言预训练模型的图像编码器所提取的视觉特征，基于视频适配器，去学习有关待学习视频参数相关的视频文本的视觉信息。换言之，视频适配器接收到视觉特征，通过待学习视频参数学习学习视频的文本信息，并协助模型建立整个视频序列建立语义对应关系和理解视频的全局信息，提取视频视觉信息。In this embodiment, since the image and text of the target visual language pre-training model often have a weak correlation, and this weak correlation is more serious in the video language data, the target visual language pre-training model can extract rich semantic information image features, but these image features have strong semantic consistency and generalization performance, but still cannot meet the adaptation of video frames, so they need to be converted into frame features that meet the requirements of the video language pre-training model through a video frame adapter. In other words, the video frame adapter obtains text-related visual information based on the received visual features by forcing the frame parameters to be learned, and converts the visual features into frame visual information that meets the requirements of the target visual language pre-training model. It can be understood that the frame text visual information extracted by the video frame adapter is fine-grained information at the frame level, lacking understanding of the entire video. On this basis, the video adapter combines the visual features extracted by the image encoder of the target visual language pre-training model, and learns the visual information of the video text related to the video parameters to be learned based on the video adapter. In other words, the video adapter receives the visual features, learns the text information of the video through the video parameters to be learned, and assists the model to establish a semantic correspondence relationship for the entire video sequence and understand the global information of the video, and extracts the video visual information.

S103：根据帧视觉信息、视频视觉信息与文本语义特征的损失信息，对视频语言模型进行迭代更新，直至满足预设模型训练结束条件。S103: Iteratively update the video language model according to the loss information of the frame visual information, the video visual information and the text semantic features until the preset model training end condition is met.

当上个步骤获取到视频样本对应的帧视觉信息和视频视觉信息之后，将其与视频样本对应的文本描述标签的文本语义特征进行相比，通过不断缩小二者之间的差距对视频语言模型的模型参数进行更新，例如可采用批量随机梯度下降方法训练视频语言模型，直至预设模型训练结束条件，预设模型训练结束条件例如可为迭代次数达到预先设置值，也可为视频语言模型收敛，也可为视频语言模型的精度达到预设精度阈值，这均不影响本申请的实现。在梯度更新迭代之前，模型需要初始化梯度下降算法，设定epoch（训练周期）、batch_size（批尺寸），权重更新周期t，iteration（迭代次数）。举例来说，视频样本数据集包含的视频样本总数可为6万，视频语言模型被训练至少100个训练周期，一个训练周期是指不重复的利用训练集中的全部训练样本更新神经网络的模型参数，每次取一个批次（batch）数据用于更新视频语言模型的模型参数，完成一次训练的过程。在梯度更新迭代过程中，每次迭代更新使用500个视频样本，这500个视频样本被称作一个批次（batch）数据，也就是batch_size个样本数量。迭代次数iteration是指使用batch_size个样本训练的次数，完成一个epoch的迭代次数iteration=60000/500=120。权重更新周期是指视频语言模型训练时每迭代t次更新一次权重。当达到预设模型训练结束条件时的视频语言模型即为训练好的视频语言模型，其可充分利用视频在不同层面上的视觉信息，通过联合学习视频和语言的表示有效提高对视频内容的理解和推断能力，从而实现将视觉语言预训练模型的能力迁移到视频语言预训练任务，不仅保留原有模型的性能，还增强了其扩展性和灵活性，使其适用于更广泛的多模态应用领域。After the frame visual information and video visual information corresponding to the video sample are obtained in the previous step, they are compared with the text semantic features of the text description label corresponding to the video sample, and the model parameters of the video language model are updated by continuously narrowing the gap between the two. For example, the batch random gradient descent method can be used to train the video language model until the preset model training end condition is met. The preset model training end condition can be, for example, the number of iterations reaching a preset value, or the video language model converges, or the accuracy of the video language model reaches a preset accuracy threshold, which does not affect the implementation of this application. Before the gradient update iteration, the model needs to initialize the gradient descent algorithm, set epoch (training cycle), batch_size (batch size), weight update cycle t, iteration (number of iterations). For example, the total number of video samples contained in the video sample data set can be 60,000, and the video language model is trained for at least 100 training cycles. A training cycle refers to the non-repetitive use of all training samples in the training set to update the model parameters of the neural network. Each time a batch of data is taken to update the model parameters of the video language model to complete a training process. In the gradient update iteration process, 500 video samples are used for each iteration update. These 500 video samples are called a batch of data, that is, batch_size samples. The number of iterations, iteration, refers to the number of times batch_size samples are used for training. The number of iterations to complete an epoch, iteration = 60000/500 = 120. The weight update cycle refers to updating the weight every t iterations during video language model training. When the preset model training end condition is reached, the video language model is a trained video language model, which can make full use of the visual information of the video at different levels, and effectively improve the understanding and inference ability of the video content by jointly learning the representation of video and language, thereby realizing the transfer of the capabilities of the visual language pre-training model to the video language pre-training task, which not only retains the performance of the original model, but also enhances its scalability and flexibility, making it suitable for a wider range of multimodal application fields.

在本实施例提供的技术方案中，利用视频帧适配器将视频样本的视觉特征转换为满足视觉语言预训练模型需求的帧视觉信息，利用待学习帧参数学习文本相关的视觉信息，协助模型在不同帧之间建立关联以及引导模型关注不同视觉信息，能够解决视频模态与语言模态弱相关问题，实现视觉语言预训练模型到视频帧的适配。利用视频适配器能够整合帧文本视觉信息，解决语言模态聚焦范围的问题，利用待学习视频参数辅助模型在整个视频序列上建立语义对应关系和理解视频的全局信息，将全局视频信息和帧文本视觉信息进行整合，解决帧文本视觉信息因注意力偏差而导致的信息丢失，从而充分利用视频在局部细节和全局语义等不同层面的视觉信息，提升模型对视频的视觉理解能力，提升视频语言模型的表征能力，使得视频语言模型在训练过程快速收敛，提升视频语言模型的训练效率，节省训练所需计算资源。进一步的，在现有视觉语言预训练模型的基础上，通过加入视频帧适配器和视频适配器构建视频语言模型，无需对原来的视觉语言预训练模型结构进行改变，无需重新设计所有网络结构，不需要大量的视频文本数据重新训练视频语言模型，便能实现将视觉语言预训练模型丰富的视觉表征以及强大的跨模态交互能力的迁移到视频语言任务中，不仅保留原有模型性能，还增强了其扩展性和灵活性。In the technical solution provided in this embodiment, a video frame adapter is used to convert the visual features of a video sample into frame visual information that meets the requirements of a visual language pre-training model, and the frame parameters to be learned are used to learn text-related visual information, assist the model in establishing associations between different frames, and guide the model to focus on different visual information, which can solve the problem of weak correlation between video modality and language modality, and achieve the adaptation of the visual language pre-training model to the video frame. The video adapter can be used to integrate frame text visual information, solve the problem of the focus range of the language modality, and use the video parameters to be learned to assist the model in establishing semantic correspondences and understanding the global information of the video on the entire video sequence, integrate the global video information and frame text visual information, and solve the information loss of frame text visual information due to attention bias, thereby making full use of the visual information of the video at different levels such as local details and global semantics, improving the model's visual understanding ability of the video, and improving the representation ability of the video language model, so that the video language model converges quickly during the training process, improving the training efficiency of the video language model, and saving the computing resources required for training. Furthermore, based on the existing visual language pre-training model, a video language model is constructed by adding video frame adapters and video adapters. There is no need to change the original visual language pre-training model structure, no need to redesign all network structures, and no need to retrain the video language model with a large amount of video text data. It is possible to migrate the rich visual representation and powerful cross-modal interaction capabilities of the visual language pre-training model to video language tasks, which not only retains the performance of the original model, but also enhances its scalability and flexibility.

在上述实施例中，对于视频帧适配器的结构并不做任何限定，基于上述实施例，如图2所示，本发明还给出了视频帧适配器的一种示例性的结构，可包括下述内容：In the above embodiment, there is no limitation on the structure of the video frame adapter. Based on the above embodiment, as shown in FIG2 , the present invention further provides an exemplary structure of the video frame adapter, which may include the following contents:

在本实施例中，视频帧适配器可包括帧输入层、文本编码层、跨模态融合层、特征增强层和帧输出层。其中，帧输入层，用于接收帧参数特征和视频帧文本特征的拼接结果；文本编码层内置一文本编码器，其用于基于当前注意力掩码对拼接结果进行编码，得到帧参数编码特征和帧文本编码特征；跨模态融合层，用于将帧参数编码特征和视觉特征进行跨模态融合处理；特征增强层，用于对融合结果进行特征增强处理，并将增强特征输入至文本编码层，重复上述过程多次直至达到第一预设重复次数，如M1次，M1可为200，得到帧视觉信息；帧输出层用于将该帧视觉信息进行输出。In this embodiment, the video frame adapter may include a frame input layer, a text encoding layer, a cross-modal fusion layer, a feature enhancement layer, and a frame output layer. Among them, the frame input layer is used to receive the splicing result of the frame parameter feature and the video frame text feature; the text encoding layer has a built-in text encoder, which is used to encode the splicing result based on the current attention mask to obtain the frame parameter encoding feature and the frame text encoding feature; the cross-modal fusion layer is used to perform cross-modal fusion processing on the frame parameter encoding feature and the visual feature; the feature enhancement layer is used to perform feature enhancement processing on the fusion result and input the enhanced feature to the text encoding layer, repeat the above process multiple times until the first preset number of repetitions is reached, such as M1 times, M1 can be 200, to obtain the frame visual information; the frame output layer is used to output the frame visual information.

其中，为了便于表示，可将待学习帧参数对应的参数特征定义为帧参数特征，文本描述标签包括视频帧描述文本标签和视频描述文本标签，视频描述文本标签为描述整个视频对应的文本数据，视频帧描述文本标签为描述当前这帧图像对应的文本数据，举例来说，视频描述文本标签为服务器面板上的指示灯在闪烁，视频帧描述文本标签为位于服务器面板的指示灯被点亮，且出射绿色的光。为了便于描述，可将视频帧描述文本标签对应的文本语义特征定义为视频帧文本特征。帧参数特征为目标视觉语言预训练模型的文本编码器对待学习帧参数进行编码后所输出的特征，视频帧文本特征为目标视觉语言预训练模型的文本编码器对视频帧描述文本标签进行编码后的特征，可根据目标视觉语言预训练模型所采用的模型类型对待学习帧参数和视频帧描述文本标签进行相应的编码处理，本发明不对其进行任何限定。帧输入层会预先设置注意力掩码，通过注意力掩码的数值可对应表示对输入的哪种特征进行编码处理。举例来说，注意力掩码例如可表示为，其中，/>为第i个待学习帧参数的掩码，/>第i个单词的掩码，注意力掩码给予不同类型的掩码值，代表对不同类型的数据进行编码。举例来说，当仅对待学习帧参数编码时，掩码掉视频帧描述文本标签，可设置/>为1，若输入有视频帧描述文本标签时，令/>为0，如果没有输入视频帧描述文本标签，则可以不进行设置。当仅对视频帧描述文本标签编码时，掩码掉待学习帧参数，可设置/>为1，若输入中有待学习帧参数时，可设置/>为0，若输入中没有待学习帧参数时，则可不进行设置。若同时对待学习帧参数和视频帧描述文本标签进行编码，可设置注意力掩码中所有值为1，以此表示对两个模态同时编码。帧文本编码特征为采用视频帧适配器的文本编码器对视频帧文本特征进行编码处理之后所得到的特征，帧参数编码特征为采用视频帧适配器的文本编码器对帧参数特征进行编码后所得到的特征。文本编码层内置的文本编码器可与目标视觉语言预训练模型的文本编码器的类型相同，也可不同，这均不影响本发明的实现。跨模态融合层可采用任何一种能够实现跨模态数据融合的机制进行数据融合，强迫待学习帧参数提取文本相关的视觉特征，同时依次为桥梁促使两个模态的信息交互和整合，跨模态融合层可内置采用任何一种能够提取不同模态数据内在关系并将不同模态数据进行融合的方法，如可为任何一种注意力机制或跨模态注意力机制。以跨模态注意力机制为例，跨模态融合层可为跨模态注意力机制层，可以帧参数编码特征作为查询向量，视觉特征作为一组值向量和键向量，基于跨模态注意力机制对帧参数编码特征和视觉特征进行编码，将编码结果以作为融合结果。为了进一步提高对融合特征提取的精度，加快视频帧适配器的收敛速度，可通过层归一化加快模型收敛，同时通过残差结构进一步提升泛化能力，相应的，特征增强层可包括第一特征增强层、交互特征提取层和第二特征增强层；第一特征增强层，用于对融合结果进行层归一化处理，并通过残差连接，得到第一交互增强特征；交互特征提取层，用于对第一交互增强特征进行特征提取，得到第二交互增强特征；第二特征增强层，用于对第二交互增强特征进行层归一化处理，并通过残差连接。交互特征提取层可为任何一种能提取深层次特征的模型结构，例如可采用前馈神经网络或全连接神经网络或视频特征提取层，其前馈神经网络通过线性变换，先将数据映射到高纬度的空间再映射到低纬度的空间，提取了更深层次的特征。举例来说，当前图像帧对应的视觉特征为/>，基于跨模态注意力机制对帧参数编码特征/>和视觉特征进行编码的结果可表示为/>，，经过第一特征增强层的层归一化并通过残差连接，得到第一交互增强特征/>，其中，/>为残差系数，交互特征提取层为前馈神经网络（也即Feed forward），对第一交互增强特征进行Feed forward操作，得到第二交互增强特征/>。再次通过层归一化并通过残差连接，得到/>，其中/>为残差系数。将得到的feat₂输入到文本编码层，采用第一种注意力掩码的表示方式，依次执行上述步骤，重复第一预设重复次数，得到最终帧特征/>(也即帧视觉信息)，其维度与输入的/>相同。Among them, for the convenience of representation, the parameter features corresponding to the frame parameters to be learned can be defined as frame parameter features, and the text description label includes a video frame description text label and a video description text label. The video description text label is the text data corresponding to the entire video, and the video frame description text label is the text data corresponding to the current frame image. For example, the video description text label is that the indicator light on the server panel is flashing, and the video frame description text label is that the indicator light on the server panel is lit and emits green light. For the convenience of description, the text semantic features corresponding to the video frame description text label can be defined as video frame text features. The frame parameter feature is the feature output by the text encoder of the target visual language pre-training model after encoding the frame parameters to be learned, and the video frame text feature is the feature after the text encoder of the target visual language pre-training model encodes the video frame description text label. The frame parameters to be learned and the video frame description text label can be encoded accordingly according to the model type adopted by the target visual language pre-training model, and the present invention does not impose any restrictions on it. The frame input layer will pre-set the attention mask, and the value of the attention mask can correspond to which input feature is encoded. For example, the attention mask For example, it can be expressed as , where /> is the mask of the i-th frame parameter to be learned,/> The mask of the i-th word. The attention mask gives different types of mask values, representing the encoding of different types of data. For example, when only the learning frame parameter encoding is treated, the video frame description text label is masked out, and the setting can be/> is 1, if the input contains a video frame description text label, let/> If the video frame description text label is not input, it can be left unchanged. When only encoding the video frame description text label, mask out the frame parameters to be learned, and set /> If the input contains frame parameters to be learned, you can set it to 1. is 0. If there is no frame parameter to be learned in the input, it can be left unset. If the frame parameters to be learned and the video frame description text labels are encoded at the same time, all values in the attention mask can be set to 1, which indicates that the two modalities are encoded simultaneously. The frame text encoding feature is the feature obtained after the video frame text feature is encoded by the text encoder of the video frame adapter, and the frame parameter encoding feature is the feature obtained after the frame parameter feature is encoded by the text encoder of the video frame adapter. The text encoder built into the text encoding layer can be the same type as the text encoder of the target visual language pre-training model, or it can be different, which does not affect the implementation of the present invention. The cross-modal fusion layer can use any mechanism that can realize cross-modal data fusion for data fusion, forcing the frame parameters to be learned to extract text-related visual features, and at the same time, it is a bridge to promote the information interaction and integration of the two modalities. The cross-modal fusion layer can be built-in with any method that can extract the intrinsic relationship between different modal data and fuse different modal data, such as any attention mechanism or cross-modal attention mechanism. Taking the cross-modal attention mechanism as an example, the cross-modal fusion layer can be a cross-modal attention mechanism layer, and the frame parameter encoding features can be used as query vectors, and the visual features can be used as a set of value vectors and key vectors. The frame parameter encoding features and visual features are encoded based on the cross-modal attention mechanism, and the encoding results are used as fusion results. In order to further improve the accuracy of fusion feature extraction and accelerate the convergence speed of the video frame adapter, the model convergence can be accelerated by layer normalization, and the generalization ability can be further improved by the residual structure. Correspondingly, the feature enhancement layer may include a first feature enhancement layer, an interactive feature extraction layer, and a second feature enhancement layer; the first feature enhancement layer is used to perform layer normalization on the fusion result, and obtain the first interactive enhancement feature through residual connection; the interactive feature extraction layer is used to extract the first interactive enhancement feature to obtain the second interactive enhancement feature; the second feature enhancement layer is used to perform layer normalization on the second interactive enhancement feature and obtain the second interactive enhancement feature through residual connection. The interactive feature extraction layer can be any model structure that can extract deep features, such as a feedforward neural network or a fully connected neural network or a video feature extraction layer. The feedforward neural network maps the data to a high-dimensional space and then to a low-dimensional space through linear transformation, thereby extracting deeper features. For example, the visual feature corresponding to the current image frame is / > , encoding features of frame parameters based on cross-modal attention mechanism/> The result of encoding with visual features can be expressed as/> , , after the layer normalization of the first feature enhancement layer and the residual connection, the first interactive enhancement feature is obtained/> , where / > is the residual coefficient, the interactive feature extraction layer is a feed-forward neural network (also known as Feed forward), which performs a Feed forward operation on the first interactive enhancement feature to obtain the second interactive enhancement feature/> . Again through layer normalization and residual connection, we get /> , where/> is the residual coefficient. The obtained feat ₂ is input into the text encoding layer, and the first attention mask representation method is adopted. The above steps are performed in sequence, and the first preset number of repetitions is repeated to obtain the final frame feature. (i.e., frame visual information), whose dimension is the same as the input /> same.

为了使所属领域技术人员更加清楚编码过程，本实施例以目标视觉语言预训练模型为CLIP为例，相应的文本编码器为CLIP text Encoder（也即CLIP文本编码器）。示例性的，可利用目标视觉语言预训练模型的文本编码器对待学习帧参数进行随机初始化处理，并将待学习帧参数的随机初始化结果作为帧参数特征；利用目标视觉语言预训练模型的文本编码器对视频帧描述文本标签进行词元化处理，并对词元化处理结果进行词嵌入处理，得到视频帧文本特征。举例来说，随机初始化待学习帧参数（如可定义为frame learnablecontext），可将结果记为，其中，/>为实数集，C为embedding（嵌入）长度，D为embedding维度，D的维度等于视频帧文本特征token（词元）化之后的维度。待学习帧参数初始化时，C和D可采用均值为0，方差为1的随机值。视频帧描述文本标签词元化的结果可记为，text为视频帧描述文本标签，/>为任意BERT模型，对词元化的结果可进行word_embedding（词嵌入）处理，得到视频帧文本特征。word_embedding结果可表示为/>。word_embedding为视频帧适配器对应的embedding，以目标视觉语言预训练模型为CLIP为例，相应的文本编码器为CLIP text Encoder（也即CLIP文本编码器），输出结果对应的/>、/> 。拼接待学习帧参数的随机初始化结果flc与视频帧文本特征/>，并将拼接结果作为视频帧适配器的帧输入层的输入input，/>，其中，为第i个待学习帧参数对应的帧参数特征，/>为第i个单词的embedding也即视频帧文本特征。利用视频帧适配器的文本编码层内置的文本编码器对随机初始化结果flc与视频帧文本特征进行编码，可将编码结果记为/>，，其中，前c个数据为帧参数特征对应的帧参数编码特征，可用表示，后m个数据为对视频帧文本特征编码也即帧文本编码特征，可以/>表示。本实施例需要对待学习帧参数和视频帧描述文本标签均进行编码，故此处的注意力掩码采用上述第三种方式。In order to make the encoding process more clear to the technicians in the relevant field, this embodiment takes the target visual language pre-trained model as CLIP as an example, and the corresponding text encoder is CLIP text Encoder (also known as CLIP text encoder). Exemplarily, the text encoder of the target visual language pre-trained model can be used to randomly initialize the frame parameters to be learned, and the randomly initialized results of the frame parameters to be learned are used as frame parameter features; the text encoder of the target visual language pre-trained model is used to tokenize the video frame description text label, and the tokenization result is word embedded to obtain the video frame text feature. For example, the frame parameters to be learned are randomly initialized (such as can be defined as frame learnablecontext), and the result can be recorded as , where /> is a set of real numbers, C is the embedding length, and D is the embedding dimension. The dimension of D is equal to the dimension of the video frame text feature after tokenization. When the learning frame parameters are initialized, C and D can use random values with a mean of 0 and a variance of 1. The result of tokenization of the video frame description text label can be recorded as , text is the text label describing the video frame, /> For any BERT model, the word_embedding process can be performed on the lemma result to obtain the video frame text features. The word_embedding result can be expressed as/> . word_embedding is the embedding corresponding to the video frame adapter. Taking the target visual language pre-training model as CLIP as an example, the corresponding text encoder is CLIP text Encoder (also known as CLIP text encoder), and the output result corresponds to/> 、/> . The random initialization result flc of the frame parameters before learning and the text features of the video frame/> The splicing result is used as the input of the frame input layer of the video frame adapter. ,in, is the frame parameter feature corresponding to the i-th frame parameter to be learned,/> is the embedding of the i-th word, i.e., the text feature of the video frame. The text encoder built into the text encoding layer of the video frame adapter is used to encode the random initialization result flc and the video frame text feature. The encoding result can be recorded as/> , , where the first c data are the frame parameter encoding features corresponding to the frame parameter features, which can be used Indicates that the last m data are the encoding of the video frame text features, that is, the frame text encoding features, which can be / > In this embodiment, both the frame parameters to be learned and the video frame description text labels need to be encoded, so the attention mask here adopts the third method mentioned above.

由上可知，本实施例通过视频帧适配器，将目标视觉语言预训练模型学习到的视觉信息进行迁移适配，完成视频语言预训练模型帧特征提取，从帧层面上解决视频与语言弱相关的问题。通过设置层归一化处理，能够加快视频帧适配器的收敛速度，同时通过残差结构进一步提升泛化能力，有利于提升整个视频语言模型的训练效率和性能。As can be seen from the above, this embodiment transfers and adapts the visual information learned by the target visual language pre-training model through the video frame adapter, completes the frame feature extraction of the video language pre-training model, and solves the problem of weak correlation between video and language at the frame level. By setting layer normalization processing, the convergence speed of the video frame adapter can be accelerated, and the generalization ability can be further improved through the residual structure, which is conducive to improving the training efficiency and performance of the entire video language model.

基于上述任意实施例所确定的视频帧适配器的模型结构，本实施例还需要对视频帧适配器进行训练，视频帧适配器的训练过程可包括下述内容：Based on the model structure of the video frame adapter determined in any of the above embodiments, this embodiment also requires training the video frame adapter. The training process of the video frame adapter may include the following contents:

提取当前帧对应的帧视觉信息的特征，得到当前帧图像对应的图像帧特征；提取当前帧对应的视频帧文本特征的特征，得到当前帧图像对应的图像帧文本特征；根据各图像帧特征与对应图像帧文本特征之间的损失信息，对视频帧适配器进行迭代更新。Extract the features of the frame visual information corresponding to the current frame to obtain the image frame features corresponding to the current frame image; extract the features of the video frame text features corresponding to the current frame to obtain the image frame text features corresponding to the current frame image; iteratively update the video frame adapter according to the loss information between each image frame feature and the corresponding image frame text feature.

在本实施例中，如图3所示，可使用与视频帧适配器对接的网络结构来提取视频帧文本特征和帧视觉信息的特征。可使用全连接层获取当前帧对应的帧视觉信息的特征，可定义为图像帧特征，/>，全连接层的输入维度为的维度，输出维度为图像帧文本特征相同。可利用前馈神经网络提取当前帧对应的视频帧文本特征的特征，可定义为图像帧文本特征/>。由于视频帧适配器的输入层为帧参数特征和视频帧文本特征的拼接结果，所以当前帧图像对应的图像帧文本特征也可直接提取当前帧对应的帧文本编码特征对应的特征，相应的，，此处的全连接层输入维度为/>维度，输出维度与图像帧特征相同。 In this embodiment, as shown in FIG3 , a network structure connected to a video frame adapter can be used to extract video frame text features and frame visual information features. A fully connected layer can be used to obtain the features of the frame visual information corresponding to the current frame, which can be defined as the image frame features ,/> , the input dimension of the fully connected layer is The dimension of the output dimension is the same as the image frame text feature. The feedforward neural network can be used to extract the features of the video frame text feature corresponding to the current frame, which can be defined as the image frame text feature/> Since the input layer of the video frame adapter is the concatenation result of the frame parameter features and the video frame text features, the image frame text features corresponding to the current frame image can also directly extract the features corresponding to the frame text encoding features corresponding to the current frame. Correspondingly, , the fully connected layer input dimension here is/> Dimension, the output dimension is the same as the image frame features .

上述实施例对如何根据各图像帧特征与对应图像帧文本特征之间的损失信息对视频帧适配器进行迭代更新并不做任何限定，也即对于视频帧适配器所采用的损失函数并不做任何限定。当然，作为一种简单的实施方式，可直接基于均方误差或交叉熵误差或其他相关技术的损失函数来作为视频帧适配器的损失函数。为了提高视频帧适配器的性能，本实施例还给出视频帧适配器的损失函数的一种确定方式，可包括下述内容：The above embodiment does not impose any limitation on how to iteratively update the video frame adapter according to the loss information between each image frame feature and the corresponding image frame text feature, that is, it does not impose any limitation on the loss function adopted by the video frame adapter. Of course, as a simple implementation, the loss function of the video frame adapter can be directly based on the mean square error or cross entropy error or other related technologies. In order to improve the performance of the video frame adapter, this embodiment also provides a method for determining the loss function of the video frame adapter, which may include the following contents:

通过利用视频帧适配器预测图像帧特征和图像帧文本特征为正向匹配还是负向不匹配，确定帧-文本匹配损失；通过比对图像帧特征和图像帧文本特征之间的相似性，确定帧-文本对比损失；掩码掉部分视频帧文本特征，通过基于剩余的视频帧文本特征对应的图像帧文本特征与各图像帧特征训练的视频帧适配器，对掩码掉的视频帧文本特征进行预测，确定文本生成损失；根据帧-文本匹配损失、帧-文本对比损失和文本生成损失确定视频帧适配器的损失函数。The frame-to-text matching loss is determined by using a video frame adapter to predict whether the image frame features and the image frame text features are a positive match or a negative mismatch; the frame-to-text comparison loss is determined by comparing the similarity between the image frame features and the image frame text features; some video frame text features are masked off, and the masked video frame text features are predicted by using a video frame adapter trained based on the image frame text features corresponding to the remaining video frame text features and each image frame feature to determine the text generation loss; the loss function of the video frame adapter is determined based on the frame-to-text matching loss, the frame-to-text comparison loss and the text generation loss.

其中，帧-文本匹配损失为视频帧与文本是否匹配的损失，目的在于学习视频帧和文本表征之间的细粒度对齐的二分类任务，视频适配器被要求预测一个图像-文本对是正向匹配还是负向不匹配，即，其中V和T分别为视频帧表征和文本表征，matched表示匹配视频帧表征和文本表征，也即正向匹配，unmatched表示不匹配视频帧表征和文本表征，也即负向不匹配，帧-文本对比损失为视频帧与文本的对比，其目的在于通过最大化视频帧表征和文本表征之间的互信息来实现它们的对齐。示例性的，帧-文本对比损失的计算过程可包括：将正向匹配的图像帧特征和图像帧文本特征作为一组正样本，将负向不匹配的图像帧特征和图像帧文本特征作为一组负样本；计算各组正样本中的图像帧特征和图像帧文本特征之间的正相似性，计算各组负样本中的图像帧特征和图像帧文本特征之间的负相似性；通过对比正相似性和负相似性确定帧-文本对比损失。换言之，该损失采用将正样本对的视频帧-文本的相似性与负样本的视频帧-文本的相似性的对比来实现。本实施例可通过待学习帧参数来获取与文本相关的视觉信息，然后与文本表征计算相似度。文本生成损失为基于视频帧的文本生成损失，通过掩码掉部分文本，通过图像帧与剩余文本去预测掩码部分文本，随后可通过交叉熵得到最终损失。为了进一步提高图像帧和文本的联合理解能力，在上述实施例基础上，还可在计算视频适配器的损失函数过程中增加视频帧掩码损失。示例性的，可先掩码视频样本中某一帧特征，通过其他帧、帧文本去预测被掩码的视频帧特征，对比真实与预测之间的差距得到相应的视频帧掩码损失。基于此，视频帧适配器的损失函数的计算过程可包括：根据帧-文本匹配损失、帧-文本对比损失和文本生成损失确定图像帧-图像帧文本损失；掩码视频样本的目标图像帧，通过基于掩码后的视频样本对应的图像帧文本特征与各图像帧特征训练的视频帧适配器，对目标图像帧进行预测，确定视频帧掩码损失；根据图像帧-图像帧文本损失和视频帧掩码损失，确定视频帧适配器的损失函数。Among them, the frame-text matching loss is the loss of whether the video frame matches the text. The purpose is to learn the binary classification task of fine-grained alignment between video frames and text representations. The video adapter is required to predict whether an image-text pair is a positive match or a negative mismatch, that is, , where V and T are the video frame representation and the text representation respectively, matched means matching the video frame representation and the text representation, i.e. positive matching, unmatched means mismatching the video frame representation and the text representation, i.e. negative mismatching, and the frame-text contrast loss is the contrast between the video frame and the text, and its purpose is to achieve their alignment by maximizing the mutual information between the video frame representation and the text representation. Exemplarily, the calculation process of the frame-text contrast loss may include: taking the positively matched image frame features and the image frame text features as a group of positive samples, and taking the negatively mismatched image frame features and the image frame text features as a group of negative samples; calculating the positive similarity between the image frame features and the image frame text features in each group of positive samples, and calculating the negative similarity between the image frame features and the image frame text features in each group of negative samples; and determining the frame-text contrast loss by comparing the positive similarity and the negative similarity. In other words, the loss is achieved by comparing the video frame-text similarity of the positive sample pair with the video frame-text similarity of the negative sample. In this embodiment, the visual information related to the text can be obtained through the frame parameters to be learned, and then the similarity is calculated with the text representation. The text generation loss is the text generation loss based on the video frame. By masking off part of the text, the masked part of the text is predicted by the image frame and the remaining text, and then the final loss can be obtained by cross entropy. In order to further improve the joint understanding ability of the image frame and the text, based on the above embodiment, the video frame mask loss can also be added in the process of calculating the loss function of the video adapter. Exemplarily, a certain frame feature in the video sample can be masked first, and the masked video frame feature can be predicted by other frames and frame texts, and the corresponding video frame mask loss can be obtained by comparing the gap between the real and the predicted. Based on this, the calculation process of the loss function of the video frame adapter may include: determining the image frame-image frame text loss according to the frame-text matching loss, the frame-text comparison loss and the text generation loss; masking the target image frame of the video sample, predicting the target image frame through the video frame adapter trained based on the image frame text feature corresponding to the masked video sample and each image frame feature, and determining the video frame mask loss; determining the loss function of the video frame adapter according to the image frame-image frame text loss and the video frame mask loss.

为了提高模型训练效，可预先在本地存储对比损失函数关系式和视频帧掩码损失函数关系式，通过调用对比损失函数关系式计算帧-文本对比损失，调用视频帧掩码损失函数关系式，计算视频帧掩码损失。对比损失函数关系式可表示为：In order to improve the model training efficiency, the contrast loss function relationship and the video frame mask loss function relationship can be stored locally in advance, and the frame-text contrast loss can be calculated by calling the contrast loss function relationship, and the video frame mask loss function relationship can be called to calculate the video frame mask loss. The contrast loss function relationship can be expressed as:

； ;

视频帧掩码损失函数关系式可表示为：The video frame mask loss function relationship can be expressed as:

； ;

式中，Loss _ITG为帧-文本对比损失，exp表示指数函数，Z _i为第i个图像帧特征，T _i为与第i个图像帧特征相匹配的图像帧文本特征，T _j为图像帧文本特征不匹配的第j个图像帧特征，N _ITG为图像帧文本特征与图像帧特征匹配的总数，θ表示图像帧特征和图像帧文本特征之间的相似性，τ为待优化参数。Loss _MTF为视频帧掩码损失，Where, Loss _ITG is the frame-text contrast loss, exp represents the exponential function, Zi is the i-th image frame feature, Ti _is the image _frame text feature that matches the i -th image frame feature, Tj is the _j - th image frame feature that does not match the image frame text feature, N _ITG is the total number of matches between the image frame text feature and the image frame feature, θ represents the similarity between the image frame feature and the image frame text feature, and τ is the parameter to be optimized. Loss _MTF is the video frame mask loss,

为小批次视频样本内部随机分布的期望，D表示随机分布，V表示图像帧特征，V _m为目标图像帧，O(V _m)为目标图像帧特征，/>为视频样本没有被掩码掉的图像帧特征，T表示/>对应的图像帧文本特征，/>为目标图像帧的图像帧文本特征，k为小批次视频样本内部被掩码掉的第k个图像帧特征，K为小批次视频样本内部被掩码掉的图像数目，model表示预测结果。 is the expectation of random distribution within a small batch of video samples, D represents random distribution, V represents image frame features, V _m represents target image frame, O( V _m ) represents target image frame features, /> is the image frame feature of the video sample that is not masked, T represents/> The corresponding image frame text features, /> is the image frame text feature of the target image frame, k is the kth image frame feature that is masked inside the small batch video sample, K is the number of images that are masked inside the small batch video sample, and model represents the prediction result.

由上可知，本实施例的图像帧-图像帧文本损失是从三个不同层面完成模态间的对齐和学习，有助于视频适配器学习到图像帧和文本之间的细粒度对应关系，提升语义理解和跨模态任务的性能，并增强视频语言模型在多模态数据上的表征能力。进一步通过其他帧、其他帧对应文本、掩码帧对应文本来预测被掩码图像帧，有助于视频适配器学习到更丰富的模态表示，从而提高对图像帧和文本的联合理解能力。As can be seen from the above, the image frame-image frame text loss of this embodiment completes the alignment and learning between modalities from three different levels, which helps the video adapter learn the fine-grained correspondence between image frames and texts, improves the performance of semantic understanding and cross-modal tasks, and enhances the representation ability of the video language model on multimodal data. Further predicting the masked image frame through other frames, other frames corresponding to text, and mask frames corresponding to text helps the video adapter learn richer modal representations, thereby improving the ability to jointly understand image frames and texts.

上述实施例对视频适配器的结构并不做任何限定，基于上述实施例，如图4所示，本发明还给出了视频适配器的一种示例性的结构，可包括下述内容：The above embodiment does not impose any limitation on the structure of the video adapter. Based on the above embodiment, as shown in FIG4 , the present invention further provides an exemplary structure of the video adapter, which may include the following contents:

视频适配器包括视频输入层、参数编码器层、特征融合层、特征提取层和视频输出层；其中，视频输入层，用于接收视觉特征和帧视觉信息的联合特征；参数编码器层，用于对视频参数特征进行编码，得到视频参数编码特征；特征融合层，用于将视频参数编码特征和联合特征进行融合处理；特征提取层，用于对融合处理结果进行特征提取，并将提取特征传输至参数编码器层；重复多次，直至达到第二预设重复次数，如M2次，M2可为300，得到视频视觉信息，视频输出层将该视频视觉信息进行输出。The video adapter includes a video input layer, a parameter encoder layer, a feature fusion layer, a feature extraction layer and a video output layer; wherein the video input layer is used to receive the joint features of the visual features and the frame visual information; the parameter encoder layer is used to encode the video parameter features to obtain the video parameter encoding features; the feature fusion layer is used to fuse the video parameter encoding features and the joint features; the feature extraction layer is used to extract features from the fusion processing results and transmit the extracted features to the parameter encoder layer; repeat multiple times until reaching a second preset number of repetitions, such as M2 times, M2 can be 300, to obtain video visual information, and the video output layer outputs the video visual information.

在本实施例中，为了便于表示，可将待学习视频参数对应的参数特征定义为视频参数特征，可将视频描述文本标签对应的文本语义特征定义为视频文本特征。视频参数特征为目标视觉语言预训练模型的文本编码器对待学习视频参数进行编码后所输出的特征，视频文本特征为目标视觉语言预训练模型的文本编码器对视频描述文本标签进行编码后的特征，可根据目标视觉语言预训练模型所采用的模型类型对待学习视频参数和视频描述文本标签进行相应的编码处理，本发明不对其进行任何限定。同样的，目标视觉语言预训练模型的文本编码器也会预先设置注意力掩码，通过注意力掩码的数值可对应表示对输入的哪种特征进行编码处理。相应的，可利用目标视觉语言预训练模型的文本编码器基于当前注意力掩码对待学习视频参数进行编码处理，得到视频参数特征。利用目标视觉语言预训练模型的文本编码器提取视频描述文本标签的视频文本特征，以目标视觉语言预训练模型为CLIP为例，通过CLIP Text Encoder对待学习视频参数和视频文本描述标签进行编码，编码结果可记为,相应的，CLIP Text Encoder的输入,/>为待学习视频参数，/>为视频描述文本标签；CLIPText Encoder的输出/>,其中前c个待学习视频参数的编码也即视频参数特征，记为/>，后m个为视频文本特征，记为/>,此处注意力掩码采用上述的第三种方式设置对应的值。In this embodiment, for the convenience of representation, the parameter features corresponding to the video parameters to be learned can be defined as video parameter features, and the text semantic features corresponding to the video description text labels can be defined as video text features. The video parameter features are the features output by the text encoder of the target visual language pre-training model after encoding the video parameters to be learned, and the video text features are the features after the text encoder of the target visual language pre-training model encodes the video description text labels. The video parameters to be learned and the video description text labels can be encoded accordingly according to the model type adopted by the target visual language pre-training model, and the present invention does not impose any restrictions on them. Similarly, the text encoder of the target visual language pre-training model will also pre-set an attention mask, and the value of the attention mask can correspond to which input feature is encoded. Correspondingly, the text encoder of the target visual language pre-training model can be used to encode the video parameters to be learned based on the current attention mask to obtain the video parameter features. The text encoder of the target visual language pre-training model is used to extract the video text features of the video description text labels. Taking the target visual language pre-training model as CLIP as an example, the video parameters to be learned and the video text description labels are encoded by CLIP Text Encoder, and the encoding result can be recorded as ,Correspondingly, the input of CLIP Text Encoder ,/> is the video parameter to be learned,/> Describes the text label for the video; output of CLIPText Encoder/> , where the encoding of the first c video parameters to be learned is also the video parameter feature, recorded as/> , the last m are video text features, denoted as/> ,Here the attention mask uses the third method mentioned above to set the corresponding value.

其中，参数编码器层可采用任何一种自注意力机制对视频参数特征进行编码，相应的，参数编码器层可表示为自注意机制层，如图5所示，视频参数编码特征用于表示参数编码器层的输出，可定义为/>,/>,self_attention表示任何自注意机制，例如可MultiHeadAttention，输入维度和输出维度与相同，header可设置为6。特征融合层用于学习输入的视频参数编码特征的相关视觉信息，可采用任何一种注意力机制来学习，如自注意机制，交叉注意力机制，这均不影响本发明的实现。为了进一步提高对融合特征提取的精度，加快视频适配器的收敛速度，可通过层归一化加快模型收敛，同时通过残差结构进一步提升泛化能力，相应的，特征融合层可采用交叉注意力机制学习视频参数编码特征的相关视觉信息，相应的，特征融合层可采用交叉注意力机制层，其可包括第一视频特征增强层、跨模态学习层和第二视频特征增强层；其中，第一视频特征增强层，用于对视频参数编码特征和视频参数特征进行残差连接，并做层归一化处理，得到参数增强特征；跨模态学习层，用于基于跨模态注意力机制，以参数增强特征为查询向量，联合特征作为一组值向量和键向量，对视频参数编码特征和联合特征进行融合处理，得到多模态融合特征；第二视频特征增强层，对多模态融合特征进行残差连接，并做层归一化处理，得到融合处理结果。特征提取层可为任何一种能提取深层次特征的模型结构，例如可采用前馈神经网络或全连接神经网络层或视频特征提取层，其中，前馈神经网络通过线性变换，先将数据映射到高纬度的空间再映射到低纬度的空间，提取了更深层次的特征。举例来说，基于/>进行残差连接并做层归一化，β₀为残差系数，通过跨模态交叉注意力机制学习相关视频学习参数相关的视觉信息，即/>。通过/>再次进行残差连接并做层归一化，/>为残差系数。对特征进行Feed forward操作，即Among them, the parameter encoder layer can use any self-attention mechanism to analyze the video parameter features. Encoding, accordingly, the parameter encoder layer can be represented as a self-attention mechanism layer, as shown in Figure 5, the video parameter encoding feature is used to represent the output of the parameter encoder layer, which can be defined as/> ,/> , self_attention represents any self-attention mechanism, such as MultiHeadAttention, the input dimension and output dimension are the same, and header can be set to 6. The feature fusion layer is used to learn the relevant visual information of the input video parameter encoding features, and any attention mechanism can be used to learn, such as self-attention mechanism, cross attention mechanism, which does not affect the implementation of the present invention. In order to further improve the accuracy of fusion feature extraction and accelerate the convergence speed of the video adapter, the model convergence can be accelerated by layer normalization, and the generalization ability can be further improved by the residual structure. Accordingly, the feature fusion layer can adopt a cross-attention mechanism to learn the relevant visual information of the video parameter coding features. Accordingly, the feature fusion layer can adopt a cross-attention mechanism layer, which may include a first video feature enhancement layer, a cross-modal learning layer and a second video feature enhancement layer; wherein the first video feature enhancement layer is used to perform residual connection on the video parameter coding features and the video parameter features, and perform layer normalization processing to obtain parameter enhancement features; the cross-modal learning layer is used to perform fusion processing on the video parameter coding features and the joint features based on the cross-modal attention mechanism, with the parameter enhancement features as the query vector and the joint features as a set of value vectors and key vectors to obtain multi-modal fusion features; the second video feature enhancement layer performs residual connection on the multi-modal fusion features, and performs layer normalization processing to obtain a fusion processing result. The feature extraction layer can be any model structure that can extract deep features, such as a feedforward neural network or a fully connected neural network layer or a video feature extraction layer. The feedforward neural network maps the data to a high-dimensional space and then to a low-dimensional space through linear transformation, thereby extracting deeper features. For example, based on/> Perform residual connection and layer normalization, β ₀ is the residual coefficient, and learn the visual information related to the video learning parameters through the cross-modal cross attention mechanism, that is, /> . By /> Perform residual connection again and do layer normalization,/> is the residual coefficient. Feed forward the feature, that is

。将/>作为参数编码器层相应步骤的输入，重复上述过程M2次，得到视频可学习参数获取文本相关的视频视觉信息/>。 . Will/> As the input of the corresponding step of the parameter encoder layer, repeat the above process M2 times to obtain the video learnable parameters to obtain the video visual information related to the text/> .

基于上述实施例，可以理解的是，视频帧适配器可通过待学习帧参数获取到有关文本的视觉信息，但这些信息虽有益于整个视频的视觉信息提取，但同时出现注意力偏差，导致缺少相关的视频的视觉信息。为了进一步提高最终视频语言模型的性能，本实施例可通过对原始帧视觉信息也即目标视觉语言预训练模型的图像编码器输出的视觉特征再次编码，得到整个视频的视觉信息，通过进一步融合提取获取视频完整的视觉信息，来填补视频帧适配器的因注意力偏差而导致的信息丢失，可包括下述内容：Based on the above embodiment, it can be understood that the video frame adapter can obtain the visual information of the text through the frame parameters to be learned, but although this information is beneficial to the visual information extraction of the entire video, it also causes attention bias, resulting in a lack of relevant video visual information. In order to further improve the performance of the final video language model, this embodiment can be used to extract the visual features of the original frame visual information, that is, the output of the image encoder of the target visual language pre-training model. Encode again to obtain the visual information of the entire video, and obtain the complete visual information of the video through further fusion extraction to fill the information loss caused by attention bias of the video frame adapter, which may include the following:

如图5所示，视频语言模型还包括对接网络层；对接网络层包括第一转换器模型、视频特征提取层和联合层；其中，第一转换器模型，用于基于自注意力机制对视觉特征进行融合，得到视觉融合特征；视频特征提取层，用于对视觉融合特征进行特征提取，并将提取的特征的维度转换为与视频适配器的输入维度相同的维度；联合层，用于将帧视觉信息和视频特征提取层的输出进行联合，并将联合特征输入至视频适配器。在本实施例中，第一转换器模型可为Transformer，针对可采用Transformer自注意机制对视觉信息融合，得到As shown in FIG5 , the video language model further includes a docking network layer; the docking network layer includes a first transformer model, a video feature extraction layer and a joint layer; wherein the first transformer model is used to fuse visual features based on a self-attention mechanism to obtain visual fusion features; the video feature extraction layer is used to extract features from the visual fusion features and convert the dimensions of the extracted features into the same dimensions as the input dimensions of the video adapter; the joint layer is used to combine the frame visual information and the output of the video feature extraction layer, and input the joint features into the video adapter. In this embodiment, the first transformer model can be a Transformer, which is used for The Transformer self-attention mechanism can be used to fuse visual information to obtain

，TF为任意Transformer任意编码器模型，例如可采用最简单的MultiHeadAttention（多头注意力机制），输入维度/>和输出维度/>相同，header可设置为3。然后通过视频特征提取层获取的视频信息，视频特征提取层为线性整流函数和全连接层的组合，其例如可为多层感知器MLP，/>，通过视频特征提取层对视觉语义信息进一步抽象和组合，能够获取更丰富视频的视觉信息，同时将/>转换到与视频帧适配器相同的维度。获取视频帧适配器处理的结果/>，即视频帧文本相关的帧视觉信息，联合层联合帧视觉信息/>和视频视觉信息/>，可定义为，其中/>。 TF is any Transformer encoder model. For example, the simplest MultiHeadAttention mechanism can be used. The input dimension is / > and output dimensions/> Similarly, the header can be set to 3. Then the video information is obtained through the video feature extraction layer, which is a combination of a linear rectification function and a fully connected layer, which can be, for example, a multi-layer perceptron MLP, /> , through the video feature extraction layer, the visual semantic information is further abstracted and combined to obtain richer visual information of the video. Convert to the same dimensions as the video frame adapter. Get the result of the video frame adapter processing/> , that is, the frame visual information related to the video frame text, the joint layer joint frame visual information/> and video visual information/> , which can be defined as , where/> .

由上可知，本实施例通过上述视频适配器整合帧文本视觉信息和全局视频信息，能够解决帧文本视觉信息因注意力偏差而导致的信息丢失，通过设置层归一化处理，能够加快视频适配器的收敛速度，同时通过残差结构进一步提升泛化能力，有利于提升整个视频语言模型的训练效率和性能。From the above, it can be seen that this embodiment integrates the frame text visual information and the global video information through the above-mentioned video adapter, which can solve the information loss of the frame text visual information caused by attention bias. By setting the layer normalization processing, the convergence speed of the video adapter can be accelerated. At the same time, the generalization ability is further improved through the residual structure, which is conducive to improving the training efficiency and performance of the entire video language model.

基于上述任意实施例所确定的视频适配器的模型结构，本实施例还需要对视频适配器进行训练，如图6所示，视频适配器的训练过程可包括下述内容：Based on the model structure of the video adapter determined in any of the above embodiments, this embodiment also requires training the video adapter. As shown in FIG6 , the training process of the video adapter may include the following contents:

提取视频视觉信息的视频特征；提取视频文本特征对应的编码文本特征；根据视频特征和编码文本特征之间的损失信息，对视频适配器进行迭代更新。Extract video features of video visual information; extract coded text features corresponding to video text features; iteratively update the video adapter according to loss information between the video features and the coded text features.

在本实施例中，可使用与视频适配器对接的网络结构来提取视频文本特征和视频视觉信息的特征。可使用全连接层获取视频视觉信息对应的视频特征，可定义为视频特征/>，/>，全连接层的输入维度为的维度，输出维度与/>相同。可利用前馈神经网络提取视频文本特征的特征，可定义为编码文本特征/>，/>，此处的前馈神经网络输入维度为/>维度，输出维度与/>相同。 In this embodiment, a network structure connected to a video adapter can be used to extract video text features and video visual information features. A fully connected layer can be used to obtain video visual information. The corresponding video features can be defined as video features/> ,/> , the input dimension of the fully connected layer is The dimension of the output is the same as /> The same. The features of video text features can be extracted using a feedforward neural network, which can be defined as encoded text features/> ,/> , where the input dimension of the feedforward neural network is/> Dimension, output dimension and /> same .

上述实施例对如何根据视频特征和编码文本特征之间的损失信息对视频适配器进行迭代更新并不做任何限定，也即对于视频适配器所采用的损失函数并不做任何限定。当然，作为一种简单的实施方式，可直接基于均方误差或交叉熵误差或其他相关技术的损失函数来作为视频适配器的损失函数。为了提高视频适配器的性能，本实施例还给出视频适配器的损失函数的一种确定方式，可包括下述内容：The above embodiment does not impose any limitation on how to iteratively update the video adapter according to the loss information between the video features and the encoded text features, that is, it does not impose any limitation on the loss function adopted by the video adapter. Of course, as a simple implementation method, the loss function of the video adapter can be directly based on the mean square error or the cross entropy error or other related technologies. In order to improve the performance of the video adapter, this embodiment also provides a method for determining the loss function of the video adapter, which may include the following contents:

预先在本地存储视频-文本损失计算关系式，调用视频-文本损失计算关系式，计算视频适配器的视频-文本损失，视频-文本损失计算关系式可表示为：；The video-text loss calculation relationship is stored locally in advance, the video-text loss calculation relationship is called, and the video-text loss of the video adapter is calculated. The video-text loss calculation relationship can be expressed as: ;

其中，Loss _G为视频-文本损失，N_G为当前批次中视频特征和编码文本特征匹配的总数，为当前批次中的第i＇个视频特征，/>为第i＇个视频特征相匹配的编码文本特征，/>为与第i＇个视频特征不匹配的第j＇个编码文本特征，θ表示视频特征和编码文本特征之间的相似性，τ为待优化参数。其中，对于本发明所涉及到的计算关系式，由于log底数是一个固定的数，不影响模型训练过程，通常可将其底数省略不写，所属领域技术人员可根据实际情况选择所需底数，这均不影响本发明的实现。Among them, Loss _G is the video-text loss, _NG is the total number of matches between video features and encoded text features in the current batch, is the i 'th video feature in the current batch,/> is the encoded text feature that matches the i 'th video feature,/> is the j'th encoded text feature that does not match the i'th video feature, θ represents the similarity between the video feature and the encoded text feature, and τ is the parameter to be optimized. Among them, for the calculation relationship involved in the present invention, since the log base is a fixed number and does not affect the model training process, the base can usually be omitted, and the technicians in the relevant field can select the required base according to the actual situation, which does not affect the implementation of the present invention.

由上可知，通过基于视觉特征和文本表征相似性来确定损失函数能够训练得到性能更好的视频适配器，提升视频语言模型的性能。From the above, we can see that by determining the loss function based on the similarity of visual features and text representations, we can train a video adapter with better performance and improve the performance of the video language model.

上述实施例对如何利用目标视觉语言预训练模型对输入的视频样本、预先设置的待学习视频参数和待学习帧参数进行处理，并不做任何限定，基于此，本实施例还提供了一种示例性的实施方式，可包括下述内容：The above embodiment does not limit how to use the target visual language pre-training model to process the input video sample, the preset video parameters to be learned, and the frame parameters to be learned. Based on this, this embodiment also provides an exemplary implementation method, which may include the following content:

对视频样本进行图像采样处理，得到多帧样本图像；利用目标视觉语言预训练模型的图像编码器提取各帧样本图像的图像特征，得到视觉特征；利用目标视觉语言预训练模型的文本编码器提取视频样本的文本描述标签的文本语义特征；利用目标视觉语言预训练模型的文本编码器分别提取待学习视频参数和待学习帧参数对应的参数特征。Perform image sampling processing on the video samples to obtain multiple frames of sample images; use the image encoder of the target visual language pre-training model to extract image features of each frame of the sample image to obtain visual features; use the text encoder of the target visual language pre-training model to extract text semantic features of the text description label of the video samples; use the text encoder of the target visual language pre-training model to respectively extract parameter features corresponding to the video parameters to be learned and the frame parameters to be learned.

其中，为了能够使得目标视觉语言预训练模型的图像编码器快速输出视觉特征，还可以进行视频拆帧操作，也即将连续的视频流转化成为单独图像帧过程，可在时间维度上设置固定时间间隔，采用均匀采样方式采取选N帧。将单个视频采样后结果记为，其中N为采样数目，V为当前视频抽帧之后的结果。In order to enable the image encoder of the target visual language pre-training model to quickly output visual features, video frame splitting can also be performed, that is, the continuous video stream is converted into a single image frame process. A fixed time interval can be set in the time dimension, and N frames are selected by uniform sampling. The result after sampling a single video is recorded as , where N is the number of samples and V is the result after the current video is frame extracted.

为了提取出具有丰富语义的视觉特征，实现对图像视觉信息的有效表示和跨模态学习的能力，目标视觉语言预训练模型的图像编码器提取各帧样本图像的图像特征的过程可包括：将当前帧图像分割为多个内容不重叠的图像块；将各图形块通过线性映射转换为一维表示，同时为相应图像块添加位置编码信息；将经过线性映射和位置编码后的图像块输入至第二转换器模型的编码器，并对第二转换器模型的编码器的输出进行特征提取，得到视频样本的视觉特征。基于标视觉语言预训练模型的图像编码器提取的视频帧特征也即视觉特征，可将最终结果表示，其中/>为帧视频特征。以目标视觉语言预训练模型为CLIP为例，其通过 Vision Transformer（ViT，视觉编码器）进行提取的图像视觉特征，先将输入的图像分割成大小为16x16的非重叠图块。这些图块代表了图像的局部区域。每个图块经过线性映射，将其从二维表示转换为一维表示。同时，为每个图块添加位置编码，以捕捉其在图像中的空间位置信息。经过线性映射和位置编码后的图块序列通过12层的Transformer的编码器。Transformer的编码器利用自注意力机制来整合图块序列中的上下文信息，建模图像的全局信息，并促进不同图块之间的交互与融合。在Transformer编码器的输出上，为了提取图像特征，并在不同位置之间传递特征信息，以捕捉图像中的细节和语义关联，可利用多层感知器进行特征提取。通过以上步骤，能够从图像中提取出具有丰富语义的视觉特征。这些特征能够捕捉图像的内容和语义信息，用于与文本输入进行跨模态匹配和学习。提取的图像特征维度为197x768，其中197表示图块序列的长度，768是图块的特征维度。In order to extract visual features with rich semantics and achieve effective representation of image visual information and cross-modal learning capabilities, the process of the image encoder of the target visual language pre-training model extracting image features of each frame of sample images may include: dividing the current frame image into multiple image blocks with non-overlapping content; converting each graphic block into a one-dimensional representation through linear mapping, and adding position coding information to the corresponding image block; inputting the image blocks after linear mapping and position coding into the encoder of the second converter model, and extracting features from the output of the encoder of the second converter model to obtain the visual features of the video sample. The video frame features extracted by the image encoder based on the target visual language pre-training model are also visual features, and the final result can be represented as , where/> is the frame video feature. Taking CLIP as an example, the target visual language pre-training model uses Vision Transformer (ViT, visual encoder) to extract image visual features. The input image is first divided into non-overlapping tiles of size 16x16. These tiles represent the local area of the image. Each tile is linearly mapped to convert it from a two-dimensional representation to a one-dimensional representation. At the same time, a position code is added to each tile to capture its spatial position information in the image. The tile sequence after linear mapping and position coding passes through the 12-layer Transformer encoder. The Transformer encoder uses a self-attention mechanism to integrate contextual information in the tile sequence, model the global information of the image, and promote interaction and fusion between different tiles. At the output of the Transformer encoder, in order to extract image features and transfer feature information between different positions to capture details and semantic associations in the image, a multi-layer perceptron can be used for feature extraction. Through the above steps, visual features with rich semantics can be extracted from the image. These features can capture the content and semantic information of the image for cross-modal matching and learning with text input. The extracted image feature dimension is 197x768, where 197 represents the length of the tile sequence and 768 is the feature dimension of the tile.

由上可知，通过视频拆帧和上述视觉特征的提取方式，能够提取出具有丰富语义的视觉特征，实现对图像视觉信息的有效表示和跨模态学习的能力，有效提升视频语言模型的性能。From the above, we can see that through video frame decomposition and the above-mentioned visual feature extraction method, visual features with rich semantics can be extracted, the effective representation of image visual information and cross-modal learning capabilities can be achieved, and the performance of the video language model can be effectively improved.

进一步的，考虑到视频语言数据或者视频语言任务，视频和文本之间的弱相关会导致视频语言模型在语义理解、模态关联更加困难，若采用端到端直接训练，会导致模型收敛慢，因此，本发明还提供了对视频语言模型的一种训练方式，可包括下述内容：Furthermore, considering the video language data or video language tasks, the weak correlation between video and text will make it more difficult for the video language model to understand the semantics and associate the modalities. If end-to-end direct training is adopted, the model converges slowly. Therefore, the present invention also provides a training method for the video language model, which may include the following contents:

将视频帧描述文本标签、待学习帧参数、视频样本数据集作为输入，通过冻结目标视觉语言预训练模型的图像编码器，利用待学习帧参数获取视频帧描述文本标签对应的视觉信息的方式，训练视频帧适配器；当视频帧适配器训练完成，将视频帧描述文本标签、视频帧描述文本标签、待学习帧参数、待学习视频参数、视频样本数据集作为输入，训练视频适配器。当然，在模型训练之前，需要预先设置模型训练的相关参数，包括但并不限制于设置待学习帧参数的维度，如C为256，768；设置训练的轮数、学习率、优化器，设置拆帧的间隔、输入到图像编码器的图像大小以及图像增强方法。在完成视频帧适配基础上，训练视频适配器时，可调低视频帧适配器的学习率，也即当接收到学习率调整指令，根据学习率调整指令的新学习率更新当前学习率；新学习率小于当前学习率，如将之前的3e-3降低到5e-4。在训练视频适配器时，除了视频部分，输入有视频帧文本描述标签、视频文本描述标签、待学习帧参数、待学习视频参数，即当前输入为，其中，/>为视频文本描述标签、待学习视频参数，/>、/>分别为视频帧文本描述标签、待学习帧参数。在对视频适配器进行训练时，通过联合待学习帧参数和待学习视频参数，能够实现跨时间尺度的细粒度对齐，提高模型对视频的全局和局部特征的理解和表达能力。通过联合视频帧适配器与视频适配器，分别在视频帧级别和视频级别完成特征的提取和适配，可以充分利用视频中不同层面的视觉信息，包括局部细节和全局语义，从而提高模型对视频的视觉理解能力。The video frame description text label, the frame parameters to be learned, and the video sample data set are used as inputs, and the video frame adapter is trained by freezing the image encoder of the target visual language pre-trained model and using the frame parameters to be learned to obtain the visual information corresponding to the video frame description text label; when the video frame adapter training is completed, the video frame description text label, the video frame description text label, the frame parameters to be learned, the video parameters to be learned, and the video sample data set are used as inputs to train the video adapter. Of course, before model training, it is necessary to pre-set the relevant parameters of the model training, including but not limited to setting the dimensions of the frame parameters to be learned, such as C is 256,768; setting the number of training rounds, learning rate, optimizer, frame splitting interval, image size input to the image encoder, and image enhancement method. On the basis of completing the video frame adaptation, when training the video adapter, the learning rate of the video frame adapter can be lowered, that is, when receiving the learning rate adjustment instruction, the current learning rate is updated according to the new learning rate of the learning rate adjustment instruction; the new learning rate is less than the current learning rate, such as reducing the previous 3e-3 to 5e-4. When training the video adapter, in addition to the video part, the input includes the video frame text description label, the video text description label, the frame parameters to be learned, and the video parameters to be learned. That is, the current input is , where /> is the video text description label and the video parameters to be learned,/> 、/> They are the text description label of the video frame and the frame parameters to be learned. When training the video adapter, by combining the frame parameters to be learned and the video parameters to be learned, fine-grained alignment across time scales can be achieved, improving the model's ability to understand and express the global and local features of the video. By combining the video frame adapter and the video adapter, feature extraction and adaptation are completed at the video frame level and the video level respectively, which can make full use of visual information at different levels in the video, including local details and global semantics, thereby improving the model's visual understanding of the video.

进一步的，为了提高模型训练效率，可预先存储视频帧适配器损失函数和视频语言损失函数，在训练时，可直接调用。训练视频帧适配器时，可调用视频帧适配器损失函数计算视频帧适配器的损失函数；训练视频适配器时，可调用视频语言损失函数计算视频适配器的损失函数。其中，视频帧适配器损失函数为：Furthermore, in order to improve the efficiency of model training, the video frame adapter loss function and the video language loss function can be pre-stored and directly called during training. When training a video frame adapter, the video frame adapter loss function can be called to calculate the loss function of the video frame adapter; when training a video adapter, the video language loss function can be called to calculate the loss function of the video adapter. Among them, the video frame adapter loss function is:

； ;

视频语言损失函数为：The video language loss function is:

； ;

其中，Loss表示视频语言损失函数，Loss _frame表示视频帧适配器损失函数，Loss _ITM为帧-文本匹配损失，Loss _ITC为文本生成损失，Loss _ITG为帧-文本对比损失，Loss _MEF为视频帧掩码损失，α ₀为帧-文本匹配损失系数，α ₁为文本生成损失系数，α ₂为帧-文本对比损失系数，β为视频帧掩码损失系数。Loss表示视频语言损失函数，α为视频帧适配器损失函数系数，Loss _G为视频-文本损失，γ为视频-文本损失函数系数。Wherein, Loss represents the video language loss function, Loss _frame represents the video frame adapter loss function, Loss _ITM is the frame-text matching loss, Loss _ITC is the text generation loss, Loss _ITG is the frame-text contrast loss, Loss _MEF is the video frame mask loss, α ₀ is the frame-text matching loss coefficient, α ₁ is the text generation loss coefficient, α ₂ is the frame-text contrast loss coefficient, and β is the video frame mask loss coefficient. Loss represents the video language loss function, α is the video frame adapter loss function coefficient, Loss _G is the video-text loss, and γ is the video-text loss function coefficient.

由上可知，本实施例首先将视觉语言模型在视频帧适配器进行训练，随后在降低视频帧适配器学习力度基础上，完成对视频适配器学习，从而完成对整个视频语言预训练模型的训练，能够提升视频语言模型的的语义理解能力，提升视频语言模型的收敛速度，进一步提升视频语言模型的训练速度和性能。From the above, it can be seen that this embodiment first trains the visual language model on the video frame adapter, and then completes the learning of the video adapter on the basis of reducing the learning intensity of the video frame adapter, thereby completing the training of the entire video language pre-training model, which can improve the semantic understanding ability of the video language model, improve the convergence speed of the video language model, and further improve the training speed and performance of the video language model.

可以理解的是，在视频语言相关任务处理过程中，预训练模型包括预训练过程和微调过程，微调过程为将预训练过的模型应用当前下游的视频语言应用任务的数据集，并使模型参数适应自己数据集的过程。上述实施例训练得到适用于视频语言任务的视频语言模型，该视频语言模型为具有很强的泛化能力的预训练大模型，本实施例对该预训练大模型进行微调，得到用于执行某一指定视频语言任务的视频语言模型。基于此，本发明还提供了一种视频语言任务执行方法，请参阅图7，可包括下述内容：It can be understood that in the process of processing video language related tasks, the pre-trained model includes a pre-training process and a fine-tuning process. The fine-tuning process is the process of applying the pre-trained model to the data set of the current downstream video language application task and adapting the model parameters to its own data set. The above embodiment trains a video language model suitable for video language tasks. The video language model is a pre-trained large model with strong generalization ability. This embodiment fine-tunes the pre-trained large model to obtain a video language model for performing a specified video language task. Based on this, the present invention also provides a video language task execution method, please refer to Figure 7, which may include the following contents:

S701：训练视频语言模型；S701: training a video language model;

S702：获取待执行视频语言任务和对应视频语言任务样本集；S702: Obtain a video language task to be executed and a corresponding video language task sample set;

S703：基于视频语言任务，利用视频语言任务样本集对视频语言模型进行微调；S703: Based on the video language task, fine-tune the video language model using the video language task sample set;

S704：利用微调后的视频语言模型执行视频语言任务。S704: Perform a video language task using the fine-tuned video language model.

在预训练阶段，一般是基于大规模语料库，针对特定语言模型训练任务，训练大规模神经网络算法结构来学习实现，最终得到的大规模神经网络算法结构及参数就是预训练模型也即上述实施例所记载的视频语言模型。在本实施例，是利用如前任意一个实施例所记载的视频语言模型训练方法训练得到视频语言模型。在微调阶段，针对特定的任务目标(下游任务)和任务数据(下游数据)进行小规模训练，实现对视频语言模型的参数的微小调整，最终得到适配特定任务和数据的视频语言模型。在本实施例中，获取待执行视频语言任务和对应的视频语言任务训练样本集；基于视频语言任务，利用视频语言任务训练样本集对视频语言模型进行微调；下游任务即为待执行视频语言任务，任务数据即为视频语言任务训练样本集。最后利用微调后的视频语言模型执行视频语言任务。In the pre-training stage, it is generally based on a large-scale corpus, for a specific language model training task, to train a large-scale neural network algorithm structure to learn and implement, and the large-scale neural network algorithm structure and parameters finally obtained are the pre-training model, that is, the video language model recorded in the above embodiment. In this embodiment, the video language model is trained using the video language model training method recorded in any of the previous embodiments. In the fine-tuning stage, small-scale training is performed for specific task objectives (downstream tasks) and task data (downstream data) to achieve slight adjustments to the parameters of the video language model, and finally obtain a video language model adapted to specific tasks and data. In this embodiment, the video language task to be executed and the corresponding video language task training sample set are obtained; based on the video language task, the video language model is fine-tuned using the video language task training sample set; the downstream task is the video language task to be executed, and the task data is the video language task training sample set. Finally, the video language task is executed using the fine-tuned video language model.

其中，待执行视频语言任务包括但并不限于视频内容理解和分类任务、视频字幕生成和翻译任务、视频问答任务、视频摘要和高亮生成任务和视频检索和推荐任务。其中，视频内容理解和分类任务是指利用视频语言模型理解视频内容，并将其分类到不同的类别中，如电影、体育赛事、新闻报道等，如视频分类、视频库内容管理。视频字幕生成和翻译任务是利用视频语言模型理解视频内容和对话，自动生成字幕，甚至进行多语言翻译。如自动为电影或电视节目生成字幕，跨语言视频内容的访问。视频问答任务是指利用视频语言模型理解视频内容并回答有关视频的问题，如教育平台上的互动学习，客户服务中的自动问题解答。视频摘要和高亮生成任务是指利用视频语言模型自动识别视频中的关键时刻并生成摘要或高亮片段，适用于长视频内容的快速浏览。如体育赛事的精彩时刻回放，会议录像的关键内容摘要。视频检索和推荐任务是指通过理解视频内容和用户查询，提高视频搜索的准确性和相关性，如在线视频平台的搜索和推荐，数字图书馆的视频检索。视频语言任务训练样本集即为上述待执行视频语言任务对应的训练样本数据集，也即预训练模型在适用于待执行视频语言任务进行微调后程中使用的训练样本集，以待执行视频语言任务为视频问答任务为例，视频语言任务训练样本集包括多个不同类型的视频样本，每个视频样本预先经过人工或自动化标注相应的问题和对应答案的标签。以待执行视频语言任务为视频内容理解任务为例，视频语言任务训练样本集为携带视频内容标签的多个视频样本的视频样本集，也即视频语言任务训练样本集为视频内容理解任务训练样本集，该视频内容理解任务训练样本集包括多个不同类型的视频样本，每个视频样本携带对应视频内容的标签，以便视频语言模型来学习视频内容，并自动理解并测输入视频的视频内容。Among them, the video language tasks to be performed include but are not limited to video content understanding and classification tasks, video subtitle generation and translation tasks, video question-answering tasks, video summary and highlight generation tasks, and video retrieval and recommendation tasks. Among them, the video content understanding and classification task refers to the use of video language models to understand video content and classify it into different categories, such as movies, sports events, news reports, etc., such as video classification and video library content management. The video subtitle generation and translation task is to use the video language model to understand video content and dialogue, automatically generate subtitles, and even perform multi-language translation. For example, automatically generate subtitles for movies or TV shows, and access cross-language video content. The video question-answering task refers to the use of video language models to understand video content and answer questions about the video, such as interactive learning on educational platforms and automatic question answering in customer service. The video summary and highlight generation task refers to the use of video language models to automatically identify key moments in the video and generate summaries or highlight clips, which are suitable for quick browsing of long video content. For example, the wonderful moments of sports events are replayed, and the key content summary of conference recordings. Video retrieval and recommendation tasks refer to improving the accuracy and relevance of video searches by understanding video content and user queries, such as search and recommendation on online video platforms, and video retrieval in digital libraries. The video language task training sample set is the training sample data set corresponding to the above-mentioned video language task to be performed, that is, the training sample set used by the pre-trained model in the process of fine-tuning for the video language task to be performed. Taking the video language task to be performed as a video question-answering task as an example, the video language task training sample set includes multiple different types of video samples, and each video sample is pre-labeled with the corresponding question and the corresponding answer label manually or automatically. Taking the video language task to be performed as a video content understanding task as an example, the video language task training sample set is a video sample set of multiple video samples carrying video content labels, that is, the video language task training sample set is a video content understanding task training sample set, and the video content understanding task training sample set includes multiple different types of video samples, and each video sample carries a label corresponding to the video content, so that the video language model can learn the video content, and automatically understand and measure the video content of the input video.

由上可知，本实施例通过先通过上述实施例所记载的模型训练方法训练得到预训练模型，然后通过所要应用的下游的视频语言任务对视频语言模型的参数进行微调，得到能够执行下游视频语言任务的视频语言模型，有利于提升视频语言任务的执行效率和执行精度，满足用户对视频语言相关任务的执行需求。From the above, it can be seen that this embodiment obtains a pre-trained model by first training through the model training method recorded in the above embodiment, and then fine-tunes the parameters of the video language model through the downstream video language task to be applied, so as to obtain a video language model capable of executing downstream video language tasks, which is beneficial to improving the execution efficiency and execution accuracy of video language tasks and meeting the user's execution requirements for video language related tasks.

需要说明的是，本发明中各步骤之间没有严格的先后执行顺序，只要符合逻辑上的顺序，则这些步骤可以同时执行，也可按照某种预设顺序执行，图1和图7只是一种示意方式，并不代表只能是这样的执行顺序。It should be noted that there is no strict order of execution between the steps in the present invention. As long as they comply with the logical order, these steps can be executed simultaneously or in a preset order. Figures 1 and 7 are only a schematic diagram and do not mean that this is the only execution order.

最后，基于上述本发明的技术方案，下面结合图8对本发明的技术方案涉及的一些可能的应用场景进行举例介绍，图8为本发明提供的一种视频语言任务执行方法所适用的硬件组成框架示意图，可包括下述内容：Finally, based on the technical solution of the present invention, some possible application scenarios involved in the technical solution of the present invention are introduced as examples in conjunction with FIG8. FIG8 is a schematic diagram of a hardware composition framework applicable to a video language task execution method provided by the present invention, which may include the following contents:

该硬件组成框架可以包括第一电子设备81和第二电子设备82，第一电子设备81和第二电子设备82之间通过网络83连接。第一电子设备81部署用于执行上述任意一实施例所记载的视频语言模型训练方法的处理器，并将训练好的视频语言模型发送至第二电子设备82。第二电子设备82部署用于提供人机交互的界面，并存储经过预训练阶段的视频语言模型，当接收到视频问答任务，获取训练视频问答任务对应的训练样本集；基于视频问答任务，利用视频问答任务样本集对视频语言模型进行微调；利用微调后的视频语言模型执行视频问答任务。The hardware component framework may include a first electronic device 81 and a second electronic device 82, and the first electronic device 81 and the second electronic device 82 are connected via a network 83. The first electronic device 81 is deployed with a processor for executing the video language model training method described in any of the above embodiments, and sends the trained video language model to the second electronic device 82. The second electronic device 82 is deployed to provide an interface for human-computer interaction, and stores a video language model that has undergone a pre-training phase. When a video question-and-answer task is received, a training sample set corresponding to the training video question-and-answer task is obtained; based on the video question-and-answer task, the video language model is fine-tuned using the video question-and-answer task sample set; and the video question-and-answer task is performed using the fine-tuned video language model.

其中，第一电子设备81完成上述实施例所记载的视频语言模型训练中的全部或部分步骤，其所搭建的视频语言模型如图9所示，视频语言模型包括目标视觉语言预训练模型的图像编码器和文本编码器、视频帧适配器、视频适配器、视频拆帧模块。其输入包括视频帧描述文本标签、视频帧描述文本标签、待学习帧参数、待学习视频参数、视频样本数据集。目标视觉语言预训练模型用于提取视觉特征、帧参数特征和视频参数特征，并将其分别对应输入至视频帧适配器和视频适配器，视频帧适配器用于将视觉特征转换为满足目标视觉语言预训练模型需求的帧视觉信息，视频适配器用于提取视频视觉信息。Among them, the first electronic device 81 completes all or part of the steps in the video language model training recorded in the above embodiment. The video language model it builds is shown in Figure 9. The video language model includes an image encoder and a text encoder, a video frame adapter, a video adapter, and a video frame decomposition module of the target visual language pre-training model. Its input includes video frame description text labels, video frame description text labels, frame parameters to be learned, video parameters to be learned, and video sample data sets. The target visual language pre-training model is used to extract visual features, frame parameter features, and video parameter features, and input them into the video frame adapter and the video adapter respectively. The video frame adapter is used to convert visual features into frame visual information that meets the requirements of the target visual language pre-training model, and the video adapter is used to extract video visual information.

需要注意的是，上述应用场景仅是为了便于理解本发明的思想和原理而示出，本发明的实施方式在此方面不受任何限制。相反，本发明的实施方式可以应用于适用的任何场景。It should be noted that the above application scenarios are only shown to facilitate understanding of the concept and principle of the present invention, and the embodiments of the present invention are not limited in this respect. On the contrary, the embodiments of the present invention can be applied to any applicable scenario.

由上可知，本实施例能够提升视频问题任务的执行效率和执行精度，满足用户对视频问答任务的执行需求。It can be seen from the above that this embodiment can improve the execution efficiency and execution accuracy of video question tasks and meet the user's execution requirements for video question and answer tasks.

本发明还针对视频语言模型训练方法、视频语言任务执行方法提供了相应的装置，进一步使得方法更具有实用性。其中，装置可从功能模块的角度和硬件的角度分别说明。下面对本发明提供的视频语言模型训练装置及视频语言任务执行装置进行介绍，该装置用以实现本发明提供的视频语言模型训练方法及视频语言任务执行方法，在本实施例中，视频语言模型训练装置及视频语言任务执行装置可以包括或被分割成一个或多个程序模块，该一个或多个程序模块被存储在存储介质中，并由一个或多个处理器所执行，已完成实施例一公开的对应的视频语言模型训练方法及视频语言任务执行方法。本实施例所称的程序模块是指能够完成特定功能的一系列计算机程序指令段，比程序本身更适合于描述视频语言模型训练装置和视频语言任务执行装置在存储介质中的执行过程。以下描述将具体介绍本实施例各程序模块的功能，下文描述的视频语言模型训练装置及视频语言任务执行装置与上文描述的对应的视频语言模型训练方法、视频语言任务执行方法可相互对应参照。The present invention also provides a corresponding device for the video language model training method and the video language task execution method, which further makes the method more practical. Among them, the device can be described from the perspective of functional modules and hardware. The video language model training device and the video language task execution device provided by the present invention are introduced below. The device is used to implement the video language model training method and the video language task execution method provided by the present invention. In this embodiment, the video language model training device and the video language task execution device may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors to complete the corresponding video language model training method and video language task execution method disclosed in the first embodiment. The program module referred to in this embodiment refers to a series of computer program instruction segments that can complete specific functions, which is more suitable for describing the execution process of the video language model training device and the video language task execution device in the storage medium than the program itself. The following description will specifically introduce the functions of each program module of this embodiment. The video language model training device and the video language task execution device described below can be referenced to each other with the corresponding video language model training method and the video language task execution method described above.

基于功能模块的角度，参见图10，图10为本实施例提供的视频语言模型训练装置在一种具体实施方式下的结构图，该装置可包括：From the perspective of functional modules, see FIG. 10 , which is a structural diagram of a video language model training device provided in this embodiment under a specific implementation mode, and the device may include:

数据获取模块101，用于获取携带文本描述标签的视频样本数据集、预先设置的待学习视频参数及待学习帧参数；The data acquisition module 101 is used to acquire a video sample data set carrying text description labels, preset video parameters to be learned, and frame parameters to be learned;

输入数据处理模块102，用于将视频样本数据集中的视频样本、待学习视频参数和待学习帧参数输入至视频语言模型；视频语言模型包括目标视觉语言预训练模型、视频帧适配器和视频适配器；目标视觉语言预训练模型用于提取视觉特征和参数特征并将其分别对应输入至视频帧适配器和视频适配器，视频帧适配器用于将视觉特征转换为满足目标视觉语言预训练模型需求的帧视觉信息，视频适配器用于提取视频视觉信息；其中，待学习视频参数对应的参数特征输入至视频适配器，待学习帧参数对应的参数特征输入至视频帧适配器，以利用待学习帧参数获取文本相关的视觉信息；The input data processing module 102 is used to input the video samples, the video parameters to be learned and the frame parameters to be learned in the video sample data set into the video language model; the video language model includes a target visual language pre-training model, a video frame adapter and a video adapter; the target visual language pre-training model is used to extract visual features and parameter features and input them into the video frame adapter and the video adapter respectively, the video frame adapter is used to convert the visual features into frame visual information that meets the requirements of the target visual language pre-training model, and the video adapter is used to extract video visual information; wherein the parameter features corresponding to the video parameters to be learned are input into the video adapter, and the parameter features corresponding to the frame parameters to be learned are input into the video frame adapter, so as to obtain text-related visual information using the frame parameters to be learned;

模型参数更新模块103，用于根据帧视觉信息、视频视觉信息与文本语义特征的损失信息，对视频语言模型进行迭代更新，直至满足预设模型训练结束条件。The model parameter updating module 103 is used to iteratively update the video language model according to the loss information of the frame visual information, the video visual information and the text semantic features until the preset model training end condition is met.

示例性的，在本实施例的一些实施方式中，上述视频帧适配器可包括帧输入层、文本编码层、跨模态融合层、特征增强层和帧输出层；Exemplarily, in some implementations of this embodiment, the video frame adapter may include a frame input layer, a text encoding layer, a cross-modal fusion layer, a feature enhancement layer, and a frame output layer;

其中，帧输入层，用于接收帧参数特征和视频帧文本特征的拼接结果；文本编码层，用于基于当前注意力掩码对拼接结果进行编码，得到帧参数编码特征和帧文本编码特征；跨模态融合层，用于将帧参数编码特征和视觉特征进行跨模态融合处理；特征增强层，用于对融合结果进行特征增强处理，并将增强特征输入至文本编码层；重复多次，直至达到第一预设重复次数，得到帧视觉信息；帧输出层，用于输出帧视觉信息。Among them, the frame input layer is used to receive the splicing result of the frame parameter features and the video frame text features; the text encoding layer is used to encode the splicing result based on the current attention mask to obtain the frame parameter encoding features and the frame text encoding features; the cross-modal fusion layer is used to perform cross-modal fusion processing on the frame parameter encoding features and the visual features; the feature enhancement layer is used to perform feature enhancement processing on the fusion result and input the enhanced features to the text encoding layer; repeat multiple times until the first preset number of repetitions is reached to obtain the frame visual information; the frame output layer is used to output the frame visual information.

在本实施例的一些示例性的实施方式中，跨模态融合层为跨模态注意力机制层，跨模态融合层用于以帧参数编码特征作为查询向量，视觉特征作为一组值向量和键向量，基于跨模态注意力机制对帧参数编码特征和视觉特征进行编码，以作为融合结果。In some exemplary implementations of this embodiment, the cross-modal fusion layer is a cross-modal attention mechanism layer, which is used to use frame parameter encoding features as query vectors and visual features as a set of value vectors and key vectors, and encode frame parameter encoding features and visual features based on the cross-modal attention mechanism as a fusion result.

在本实施例的另一些示例性的实施方式中，特征增强层包括第一特征增强层、交互特征提取层和第二特征增强层；第一特征增强层，用于对融合结果进行层归一化处理，并通过残差连接，得到第一交互增强特征；交互特征提取层，用于对第一交互增强特征进行特征提取，得到第二交互增强特征；第二特征增强层，用于对第二交互增强特征进行层归一化处理，并通过残差连接。In some other exemplary implementations of the present embodiment, the feature enhancement layer includes a first feature enhancement layer, an interactive feature extraction layer and a second feature enhancement layer; the first feature enhancement layer is used to perform layer normalization on the fusion result, and obtain a first interactive enhancement feature through a residual connection; the interactive feature extraction layer is used to extract features from the first interactive enhancement feature to obtain a second interactive enhancement feature; the second feature enhancement layer is used to perform layer normalization on the second interactive enhancement feature, and obtain a second interactive enhancement feature through a residual connection.

在本实施例的另一些示例性的实施方式中，上述模型参数更新模块103包括视频帧适配器训练模块，视频帧适配器训练模块可用于：In some other exemplary implementations of this embodiment, the model parameter updating module 103 includes a video frame adapter training module, and the video frame adapter training module can be used to:

提取当前帧对应的帧视觉信息的特征，得到当前帧图像对应的图像帧特征；提取当前帧对应的帧文本编码特征，得到当前帧图像对应的图像帧文本特征；根据各图像帧特征与对应图像帧文本特征之间的损失信息，对视频帧适配器进行迭代更新。Extract the features of the frame visual information corresponding to the current frame to obtain the image frame features corresponding to the current frame image; extract the frame text encoding features corresponding to the current frame to obtain the image frame text features corresponding to the current frame image; iteratively update the video frame adapter according to the loss information between each image frame feature and the corresponding image frame text feature.

作为上述本实施例的一种示例性的实施方式，上述视频帧适配器训练模块还可进一步用于：As an exemplary implementation of the above embodiment, the video frame adapter training module may be further used for:

作为上述本实施例的另一种示例性的实施方式，上述视频帧适配器训练模块还可进一步用于：As another exemplary implementation of the above embodiment, the above video frame adapter training module may be further used for:

通过对比正相似性和负相似性确定帧-文本对比损失。The frame-text contrast loss is determined by comparing positive and negative similarities.

调用对比损失函数关系式，计算帧-文本对比损失；对比损失函数关系式为：；Call the contrast loss function relationship to calculate the frame-text contrast loss; the contrast loss function relationship is: ;

式中，Loss _ITG为帧-文本对比损失，Z _i为第i个图像帧特征，T _i为与第i个图像帧特征相匹配的图像帧文本特征，T _j为图像帧文本特征不匹配的第j个图像帧特征，N _ITG为图像帧文本特征与图像帧特征匹配的总数，θ表示图像帧特征和图像帧文本特征之间的相似性，τ为待优化参数。where Loss _ITG is the frame-text contrast loss, Zi is the i -th image frame feature, _Ti is _the image frame text feature that matches the i -th image frame feature, Tj is the j _- th image frame feature that does not match the image frame text feature, N _ITG is the total number of matches between image frame text features and image frame features, θ represents the similarity between image frame features and image frame text features, and τ is the parameter to be optimized.

根据帧-文本匹配损失、帧-文本对比损失和文本生成损失确定图像帧-图像帧文本损失；Determine the image frame-image frame text loss based on the frame-text matching loss, the frame-text contrast loss, and the text generation loss;

掩码视频样本的目标图像帧，通过基于掩码后的视频样本对应的图像帧文本特征与各图像帧特征训练的视频帧适配器，对目标图像帧进行预测，确定视频帧掩码损失；The target image frame of the masked video sample is predicted by a video frame adapter trained based on the image frame text features corresponding to the masked video sample and the features of each image frame to determine the video frame mask loss;

根据图像帧-图像帧文本损失和视频帧掩码损失，确定视频帧适配器的损失函数。The loss function of the video frame adapter is determined based on the image frame-image frame text loss and the video frame mask loss.

调用视频帧掩码损失函数关系式，计算视频帧掩码损失；视频帧掩码损失函数关系式为：；Call the video frame mask loss function relationship to calculate the video frame mask loss; the video frame mask loss function relationship is: ;

待学习视频参数对应的参数特征为视频参数特征，视频适配器包括视频输入层、参数编码器层、特征融合层、特征提取层和视频输出层；The parameter features corresponding to the video parameters to be learned are video parameter features, and the video adapter includes a video input layer, a parameter encoder layer, a feature fusion layer, a feature extraction layer and a video output layer;

其中，视频输入层，用于接收视觉特征和帧视觉信息的联合特征；参数编码器层，用于对视频参数特征进行编码，得到视频参数编码特征；特征融合层，用于将视频参数编码特征和联合特征进行融合处理；特征提取层，用于对融合处理结果进行特征提取，并将提取特征传输至参数编码器层；重复多次，直至达到第二预设重复次数，得到视频视觉信息，视频输出层，用于输出视频视觉信息。Among them, the video input layer is used to receive the joint features of visual features and frame visual information; the parameter encoder layer is used to encode the video parameter features to obtain video parameter coding features; the feature fusion layer is used to fuse the video parameter coding features and the joint features; the feature extraction layer is used to extract features from the fusion processing results and transmit the extracted features to the parameter encoder layer; repeat multiple times until a second preset number of repetitions is reached to obtain video visual information, and the video output layer is used to output the video visual information.

示例性的，在本实施例的另一些实施方式中，上述特征融合层包括第一视频特征增强层、跨模态学习层和第二视频特征增强层；Exemplarily, in some other implementations of this embodiment, the feature fusion layer includes a first video feature enhancement layer, a cross-modal learning layer, and a second video feature enhancement layer;

其中，第一视频特征增强层，用于对视频参数编码特征和视频参数特征进行残差连接，并做层归一化处理，得到参数增强特征；The first video feature enhancement layer is used to perform residual connection on the video parameter coding feature and the video parameter feature, and perform layer normalization processing to obtain the parameter enhancement feature;

跨模态学习层，用于基于跨模态注意力机制，以参数增强特征为查询向量，联合特征作为一组值向量和键向量，对视频参数编码特征和联合特征进行融合处理，得到多模态融合特征；The cross-modal learning layer is used to fuse the video parameter encoding features and the joint features based on the cross-modal attention mechanism, with the parameter enhanced features as the query vector and the joint features as a set of value vectors and key vectors to obtain multi-modal fusion features;

第二视频特征增强层，对多模态融合特征进行残差连接，并做层归一化处理，得到融合处理结果。In the second video feature enhancement layer, residual connection is performed on the multimodal fusion features, and layer normalization is performed to obtain the fusion processing result.

在本实施例的一些示例性的实施方式中，视频语言模型还包括对接网络层；对接网络层包括第一转换器模型、视频特征提取层和联合层；In some exemplary implementations of this embodiment, the video language model further includes a docking network layer; the docking network layer includes a first transformer model, a video feature extraction layer, and a joint layer;

其中，第一转换器模型，用于基于自注意力机制对视觉特征进行融合，得到视觉融合特征；视频特征提取层，用于对视觉融合特征进行特征提取，并将提取的特征的维度转换为与视频适配器的输入维度相同的维度；联合层，用于将帧视觉信息和视频特征提取层的输出进行联合，并将联合特征输入至视频适配器。Among them, the first converter model is used to fuse visual features based on the self-attention mechanism to obtain visual fusion features; the video feature extraction layer is used to extract features from the visual fusion features and convert the dimensions of the extracted features into the same dimensions as the input dimensions of the video adapter; the union layer is used to combine the frame visual information and the output of the video feature extraction layer, and input the union features into the video adapter.

在本实施例的另一些示例性的实施方式中，上述模型参数更新模块103包括视频适配器训练模块，视频适配器训练模块可用于：In some other exemplary implementations of this embodiment, the model parameter updating module 103 includes a video adapter training module, and the video adapter training module can be used to:

提取视频视觉信息的视频特征；Extract video features of video visual information;

提取视频文本特征对应的编码文本特征；Extracting encoded text features corresponding to video text features;

根据视频特征和编码文本特征之间的损失信息，对视频适配器进行迭代更新。The video adapter is iteratively updated based on the loss information between the video features and the encoded text features.

作为上述本实施例的一种示例性的实施方式，上述视频适配器训练模块还可进一步用于：As an exemplary implementation of the above embodiment, the above video adapter training module may be further used for:

调用视频-文本损失计算关系式，计算视频适配器的视频-文本损失，视频-文本损失计算关系式为：Call the video-text loss calculation formula to calculate the video-text loss of the video adapter. The video-text loss calculation formula is:

； ;

其中，Loss _G为视频-文本损失，N_G为当前批次中视频特征和编码文本特征匹配的总数，为当前批次中的第i＇个视频特征，/>为第i＇个视频特征相匹配的编码文本特征，/>为与第i＇个视频特征不匹配的第j＇个编码文本特征，θ表示视频特征和编码文本特征之间的相似性，τ为待优化参数。Among them, Loss _G is the video-text loss, _NG is the total number of matches between video features and encoded text features in the current batch, is the i 'th video feature in the current batch,/> is the encoded text feature that matches the i 'th video feature,/> is the j'th encoded text feature that does not match the i'th video feature, θ represents the similarity between the video feature and the encoded text feature, and τ is the parameter to be optimized.

示例性的，在本实施例的另一些实施方式中，上述输入数据处理模块102还可用于：Illustratively, in some other implementations of this embodiment, the input data processing module 102 may also be used for:

对视频样本进行图像采样处理，得到多帧样本图像；Performing image sampling processing on the video samples to obtain multiple frames of sample images;

利用目标视觉语言预训练模型的图像编码器提取各帧样本图像的图像特征，得到视觉特征；The image encoder of the target visual language pre-trained model is used to extract the image features of each frame of the sample image to obtain the visual features;

利用目标视觉语言预训练模型的文本编码器提取视频样本的文本描述标签的文本语义特征；The text encoder of the target visual language pre-trained model is used to extract the text semantic features of the text description labels of the video samples;

利用目标视觉语言预训练模型的文本编码器分别提取待学习视频参数和待学习帧参数对应的参数特征。The text encoder of the target visual language pre-training model is used to extract parameter features corresponding to the video parameters to be learned and the frame parameters to be learned respectively.

在本实施例的一些示例性的实施方式中，上述输入数据处理模块102还可进一步用于：In some exemplary implementations of this embodiment, the input data processing module 102 may be further used to:

利用目标视觉语言预训练模型的文本编码器对待学习帧参数进行随机初始化处理，并将待学习帧参数的随机初始化结果作为帧参数特征；The text encoder of the target visual language pre-trained model is used to randomly initialize the frame parameters to be learned, and the random initialization results of the frame parameters to be learned are used as frame parameter features;

利用目标视觉语言预训练模型的文本编码器，基于当前注意力掩码对待学习视频参数进行编码处理，得到视频参数特征。The text encoder of the target visual language pre-trained model is used to encode the video parameters to be learned based on the current attention mask to obtain the video parameter features.

在本实施例的另一些示例性的实施方式中，上述输入数据处理模块102还可进一步用于：In some other exemplary implementations of this embodiment, the input data processing module 102 may be further used for:

利用目标视觉语言预训练模型的文本编码器提取视频描述文本标签的视频文本特征；Extract video text features of video description text labels using the text encoder of the target visual language pre-trained model;

利用目标视觉语言预训练模型的文本编码器对视频帧描述文本标签进行词元化处理，并对词元化处理结果进行词嵌入处理，得到视频帧文本特征；The text encoder of the target visual language pre-trained model is used to tokenize the video frame description text label, and the tokenization result is processed by word embedding to obtain the video frame text feature;

利用目标视觉语言预训练模型的文本编码器，基于当前注意力掩码对视频描述文本标签进行编码处理，得到视频文本特征。Using the text encoder of the target visual language pre-trained model, the video description text label is encoded based on the current attention mask to obtain the video text features.

将经过线性映射和位置编码后的图像块输入至第二转换器模型的编码器，并对第二转换器模型的编码器的输出进行特征提取，得到视频样本的视觉特征。The image block after linear mapping and position encoding is input into the encoder of the second transformer model, and the output of the encoder of the second transformer model is subjected to feature extraction to obtain the visual features of the video sample.

示例性的，在本实施例的另一些实施方式中，上述模型参数更新模块103还可用于：Exemplarily, in some other implementations of this embodiment, the above-mentioned model parameter updating module 103 may also be used for:

将视频帧描述文本标签、待学习帧参数、视频样本数据集作为输入，通过冻结目标视觉语言预训练模型的图像编码器，利用待学习帧参数获取视频帧描述文本标签对应的视觉信息的方式，训练视频帧适配器；Taking the video frame description text label, the frame parameters to be learned, and the video sample data set as input, the video frame adapter is trained by freezing the image encoder of the target visual language pre-trained model and using the frame parameters to be learned to obtain the visual information corresponding to the video frame description text label;

当视频帧适配器训练完成，将视频帧描述文本标签、视频帧描述文本标签、待学习帧参数、待学习视频参数、视频样本数据集作为输入，训练视频适配器。When the video frame adapter training is completed, the video frame description text label, the video frame description text label, the frame parameters to be learned, the video parameters to be learned, and the video sample data set are used as input to train the video adapter.

在本实施例的一些示例性的实施方式中，上述模型参数更新模块103还可进一步用于：In some exemplary implementations of this embodiment, the above-mentioned model parameter updating module 103 may be further used for:

当接收到学习率调整指令，根据学习率调整指令的新学习率更新当前学习率；新学习率小于当前学习率。When a learning rate adjustment instruction is received, the current learning rate is updated according to the new learning rate of the learning rate adjustment instruction; the new learning rate is less than the current learning rate.

在本实施例的另一些示例性的实施方式中，上述模型参数更新模块103还可进一步用于：In some other exemplary implementations of this embodiment, the above-mentioned model parameter updating module 103 may be further used for:

调用视频帧适配器损失函数，训练视频帧适配器；视频帧适配器损失函数为：Call the video frame adapter loss function to train the video frame adapter; the video frame adapter loss function is:

； ;

调用视频语言损失函数，训练视频适配器；视频语言损失函数为：Call the video language loss function to train the video adapter; the video language loss function is:

； ;

基于功能模块的角度，参见图11，图11为本实施例提供的视频语言任务执行装置在一种具体实施方式下的结构图，该装置可包括：From the perspective of functional modules, see FIG. 11 , which is a structural diagram of a video language task execution device provided in this embodiment under a specific implementation mode, and the device may include:

模型训练模块111，用于训练得到视频语言模型；A model training module 111 is used to train and obtain a video language model;

数据获取模块112，用于获取待执行视频语言任务和对应视频语言任务样本集；A data acquisition module 112, used to acquire a video language task to be performed and a corresponding video language task sample set;

模型微调模块113，用于基于视频语言任务，利用视频语言任务样本集对视频语言模型进行微调；A model fine-tuning module 113, used to fine-tune the video language model based on the video language task using the video language task sample set;

任务执行模块114，用于利用微调后的视频语言模型执行视频语言任务。The task execution module 114 is used to execute the video language task using the fine-tuned video language model.

本实施例视频语言模型训练装置及视频语言任务执行装置的各功能模块的功能可根据上述方法实施例中的方法具体实现，其具体实现过程可以参照上述方法实施例的相关描述，此处不再赘述。The functions of the functional modules of the video language model training device and the video language task execution device of this embodiment can be specifically implemented according to the method in the above method embodiment. The specific implementation process can refer to the relevant description of the above method embodiment, which will not be repeated here.

由上可知，本实施例能够有效提升视频语言模型的训练效率，节省模型训练所需的计算资源。It can be seen from the above that this embodiment can effectively improve the training efficiency of the video language model and save the computing resources required for model training.

上文中提到的视频语言模型训练装置及视频语言任务执行装置是从功能模块的角度描述，进一步的，本发明还提供一种电子设备，是从硬件角度描述。图12为本发明实施例提供的电子设备在一种实施方式下的结构示意图。如图12所示，该电子设备包括存储器120，用于存储计算机程序；处理器121，用于执行计算机程序时实现如上述任一实施例提到的视频语言模型训练方法和/或视频语言任务执行方法的步骤。The video language model training device and video language task execution device mentioned above are described from the perspective of functional modules. Furthermore, the present invention also provides an electronic device, which is described from the perspective of hardware. Figure 12 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention under one implementation. As shown in Figure 12, the electronic device includes a memory 120 for storing a computer program; a processor 121 for implementing the steps of the video language model training method and/or the video language task execution method mentioned in any of the above embodiments when executing the computer program.

其中，处理器121可以包括一个或多个处理核心，比如4核心处理器、8核心处理器，处理器121还可为控制器、微控制器、微处理器或其他数据处理芯片等。处理器121可以采用DSP(Digital Signal Processing，数字信号处理)、FPGA(Field－Programmable GateArray，现场可编程门阵列)、PLA(Programmable Logic Array，可编程逻辑阵列)中的至少一种硬件形式来实现。处理器121也可以包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据进行处理的处理器，也称CPU(Central Processing Unit，中央处理器)；协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中，处理器121可以集成有GPU(Graphics Processing Unit，图形处理器)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器121还可以包括AI(ArtificialIntelligence，人工智能)处理器，该AI处理器用于处理有关机器学习的计算操作。Among them, the processor 121 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the processor 121 may also be a controller, a microcontroller, a microprocessor or other data processing chip. The processor 121 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). The processor 121 may also include a main processor and a coprocessor. The main processor is a processor for processing data in the awake state, also known as a CPU (Central Processing Unit); the coprocessor is a low-power processor for processing data in the standby state. In some embodiments, the processor 121 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the display screen. In some embodiments, the processor 121 may also include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.

存储器120可以包括一个或多个计算机可读存储介质，该计算机可读存储介质可以是非暂态的。存储器120还可包括高速随机存取存储器以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。存储器120在一些实施例中可以是电子设备的内部存储单元，例如服务器的硬盘。存储器120在另一些实施例中也可以是电子设备的外部存储设备，例如服务器上配备的插接式硬盘，智能存储卡（Smart Media Card，SMC），安全数字（Secure Digital，SD）卡，闪存卡（Flash Card）等。进一步地，存储器120还可以既包括电子设备的内部存储单元也包括外部存储设备。存储器120不仅可以用于存储安装于电子设备的应用软件及各类数据，例如：执行视频语言模型训练方法过程中以及视频语言任务执行方法过程中的程序的代码等，还可以用于暂时地存储已经输出或者将要输出的数据。本实施例中，存储器120至少用于存储以下计算机程序1201，其中，该计算机程序被处理器121加载并执行之后，能够实现前述任一实施例公开的视频语言模型训练方法及视频语言任务执行方法的相关步骤。另外，存储器120所存储的资源还可以包括操作系统1202和数据1203等，存储方式可以是短暂存储或者永久存储。其中，操作系统1202可以包括Windows、Unix、Linux等。数据1203可以包括但不限于视频语言模型训练结果及视频语言任务执行结果对应的数据等。The memory 120 may include one or more computer-readable storage media, which may be non-transitory. The memory 120 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices and flash storage devices. In some embodiments, the memory 120 may be an internal storage unit of an electronic device, such as a hard disk of a server. In other embodiments, the memory 120 may also be an external storage device of an electronic device, such as a plug-in hard disk equipped on a server, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc. Further, the memory 120 may also include both an internal storage unit of an electronic device and an external storage device. The memory 120 can not only be used to store application software and various types of data installed in the electronic device, such as: the code of the program in the process of executing the video language model training method and the video language task execution method, but also can be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 120 is at least used to store the following computer program 1201, wherein, after the computer program is loaded and executed by the processor 121, the relevant steps of the video language model training method and the video language task execution method disclosed in any of the aforementioned embodiments can be implemented. In addition, the resources stored in the memory 120 may also include an operating system 1202 and data 1203, etc., and the storage method may be temporary storage or permanent storage. Among them, the operating system 1202 may include Windows, Unix, Linux, etc. Data 1203 may include but is not limited to data corresponding to the video language model training results and the video language task execution results, etc.

在一些实施例中，上述电子设备还可包括有显示屏122、输入输出接口123、通信接口124或者称为网络接口、电源125以及通信总线126。其中，显示屏122、输入输出接口123比如键盘（Keyboard）属于用户接口，示例性的用户接口还可以包括标准的有线接口、无线接口等。可选地，在一些实施例中，显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED（Organic Light-Emitting Diode，有机发光二极管）触摸器等。显示器也可以适当的称为显示屏或显示单元，用于显示在电子设备中处理的信息以及用于显示可视化的用户界面。通信接口124示例性的可以包括有线接口和/或无线接口，如WI-FI接口、蓝牙接口等，通常用于在电子设备与其他电子设备之间建立通信连接。通信总线126可以是外设部件互连标准(peripheral component interconnect，简称PCI)总线或扩展工业标准结构(extended industry standard architecture，简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示，图12中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。In some embodiments, the electronic device may further include a display screen 122, an input/output interface 123, a communication interface 124 or a network interface, a power supply 125 and a communication bus 126. Among them, the display screen 122 and the input/output interface 123, such as a keyboard, belong to a user interface, and an exemplary user interface may also include a standard wired interface, a wireless interface, etc. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an OLED (Organic Light-Emitting Diode) touch device, etc. The display may also be appropriately referred to as a display screen or a display unit, which is used to display information processed in the electronic device and to display a visual user interface. The communication interface 124 may exemplarily include a wired interface and/or a wireless interface, such as a WI-FI interface, a Bluetooth interface, etc., which is generally used to establish a communication connection between an electronic device and other electronic devices. The communication bus 126 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG12 shows only one thick line, but this does not mean that there is only one bus or one type of bus.

本领域技术人员可以理解，图12中示出的结构并不构成对该电子设备的限定，可以包括比图示更多或更少的组件，例如还可包括实现各类功能的传感器127。Those skilled in the art will appreciate that the structure shown in FIG. 12 does not limit the electronic device and may include more or fewer components than shown in the figure, for example, may also include a sensor 127 for implementing various functions.

本实施例所述电子设备的各功能模块的功能可根据上述方法实施例中的方法具体实现，其具体实现过程可以参照上述方法实施例的相关描述，此处不再赘述。The functions of the functional modules of the electronic device described in this embodiment can be specifically implemented according to the method in the above method embodiment. The specific implementation process can refer to the relevant description of the above method embodiment, which will not be repeated here.

可以理解的是，如果上述实施例中的视频语言模型训练方法及视频语言任务执行方法以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器（Read-Only Memory，ROM）、随机存取存储器（RandomAccess Memory，RAM）、电可擦除可编程ROM、寄存器、硬盘、多媒体卡、卡型存储器（例如SD或DX存储器等）、磁性存储器、可移动磁盘、CD-ROM、磁碟或者光盘等各种可以存储程序代码的介质。It is understandable that if the video language model training method and the video language task execution method in the above-mentioned embodiment are implemented in the form of a software functional unit and sold or used as an independent product, they can be stored in a computer-readable storage medium. Based on such an understanding, the technical solution of the present invention is essentially or the part that contributes to the relevant technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium to execute all or part of the steps of the methods of each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (RandomAccess Memory, RAM), electrically erasable programmable ROM, register, hard disk, multimedia card, card-type memory (such as SD or DX memory, etc.), magnetic memory, removable disk, CD-ROM, disk or optical disk, etc. Various media that can store program codes.

基于此，本发明还提供了一种可读存储介质，存储有计算机程序，所述计算机程序被处理器执行时如上任意一实施例所述视频语言模型训练方法和/或视频语言任务执行方法的步骤。Based on this, the present invention also provides a readable storage medium storing a computer program, which, when executed by a processor, performs the steps of the video language model training method and/or the video language task execution method described in any of the above embodiments.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其它实施例的不同之处，各个实施例之间相同或相似部分互相参见即可。对于实施例公开的硬件包括装置及电子设备而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。In this specification, each embodiment is described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the embodiments can be referred to each other. For the hardware disclosed in the embodiments, including devices and electronic devices, since they correspond to the methods disclosed in the embodiments, the description is relatively simple, and the relevant parts can be referred to the method part description.

专业人员还可以进一步意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Professionals may further appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the interchangeability of hardware and software, the composition and steps of each example have been generally described in the above description according to function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of the present invention.

以上对本发明所提供的一种视频语言任务执行及其模型训练方法、装置、电子设备及可读存储介质进行了详细介绍。本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想。应当指出，基于本发明中的实施例，对于本技术领域的普通技术人员来说，在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。在不脱离本发明原理的前提下，还可以对本发明进行若干改进和修饰，这些改进和修饰也落入本发明的保护范围内。The above is a detailed introduction to a video language task execution and model training method, device, electronic device and readable storage medium provided by the present invention. Specific examples are used herein to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only used to help understand the method of the present invention and its core idea. It should be pointed out that based on the embodiments of the present invention, for ordinary technicians in this technical field, all other embodiments obtained without creative work are within the scope of protection of the present invention. Without departing from the principles of the present invention, the present invention can also be improved and modified in a number of ways, and these improvements and modifications also fall within the scope of protection of the present invention.

Claims

1. A video language model training method, characterized by comprising:

Obtain a video sample data set carrying text description labels, pre-set video parameters to be learned, and frame parameters to be learned;

The video samples in the video sample data set, the video parameters to be learned and the frame parameters to be learned are input into the video language model; the video language model includes a target visual language pre-training model, a video frame adapter and a video adapter; the target visual language pre-training model is used to extract visual features and parameter features and input them into the video frame adapter and the video adapter respectively, the video frame adapter is used to convert the visual features into frame visual information that meets the requirements of the target visual language pre-training model, and the video adapter is used to extract video visual information;

Iteratively updating the video language model according to the frame visual information, the video visual information, and the loss information of the text semantic features until a preset model training end condition is met;

wherein the parameter features corresponding to the video parameters to be learned are input into the video adapter, and the parameter features corresponding to the frame parameters to be learned are input into the video frame adapter, so as to obtain visual information related to the text using the frame parameters to be learned;

The image encoder of the target visual language pre-training model is used to extract visual features of video samples, and the text encoder of the target visual language pre-training model is used to extract text semantic features of text description labels of video samples, and extract parameter features corresponding to the video parameters to be learned and the frame parameters to be learned;

Among them, the parameter features corresponding to the frame parameters to be learned are frame parameter features, the text description labels include video frame description text labels, the text semantic features corresponding to the video frame description text labels are video frame text features, and the video frame adapter includes a frame input layer, a text encoding layer, a cross-modal fusion layer, a feature enhancement layer and a frame output layer; the frame input layer is used to receive the splicing result of the frame parameter features and the video frame text features; the text encoding layer is used to encode the splicing result based on the current attention mask to obtain the frame parameter encoding features; the cross-modal fusion layer is used to perform cross-modal fusion processing on the frame parameter encoding features and the visual features; the feature enhancement layer is used to perform feature enhancement processing on the fusion result and input the enhanced features into the text encoding layer; the frame output layer is used to output frame visual information;

The parameter features corresponding to the video parameters to be learned are video parameter features, and the video adapter includes a video input layer, a parameter encoder layer, a feature fusion layer, a feature extraction layer and a video output layer; the video input layer is used to receive the joint features of the visual features and the frame visual information; the parameter encoder layer is used to encode the video parameter features to obtain video parameter encoding features; the feature fusion layer is used to fuse the video parameter encoding features and the joint features; the feature extraction layer is used to extract features from the fusion processing results and transmit the extracted features to the parameter encoder layer; the video output layer is used to output video visual information;

Among them, each video sample has a text description label, the video description text label is the text data corresponding to the entire video, and the video frame description text label is the text data corresponding to the current frame image. When the frame visual information and video visual information corresponding to the video sample are obtained, they are compared with the text semantic features of the text description label corresponding to the video sample, and the model parameters of the video language model are updated by continuously narrowing the gap between the two.

2. The video language model training method according to claim 1, characterized in that the cross-modal fusion layer is a cross-modal attention mechanism layer, and the cross-modal fusion processing of the frame parameter encoding features and the visual features comprises:

The frame parameter encoding features are used as query vectors, and the visual features are used as a set of value vectors and key vectors. The frame parameter encoding features and the visual features are encoded based on a cross-modal attention mechanism to serve as a fusion result.

3. The video language model training method according to claim 1, characterized in that the feature enhancement layer includes a first feature enhancement layer, an interactive feature extraction layer and a second feature enhancement layer;

The first feature enhancement layer is used to perform layer normalization processing on the fusion result and obtain a first interactive enhancement feature through a residual connection;

The interactive feature extraction layer is used to extract the first interactive enhancement feature to obtain a second interactive enhancement feature;

The second feature enhancement layer is used to perform layer normalization processing on the second interactive enhancement feature and connect it through a residual connection.

4. The video language model training method according to claim 1, wherein the training process of the video frame adapter comprises:

Extract the features of the frame visual information corresponding to the current frame to obtain the image frame features corresponding to the current frame image;

Extract the video frame text features corresponding to the current frame to obtain the image frame text features corresponding to the current frame image;

The video frame adapter is iteratively updated according to the loss information between each image frame feature and the corresponding image frame text feature.

5. The video language model training method according to claim 4, characterized in that the iterative updating of the video frame adapter according to the loss information between each image frame feature and the corresponding image frame text feature comprises:

Determining a frame-text matching loss by predicting whether the image frame features and the image frame text features are a positive match or a negative mismatch using the video frame adapter;

By comparing the similarity between the image frame features and the image frame text features, the frame-text comparison loss is determined;

Masking off some video frame text features, predicting the masked video frame text features through the video frame adapter trained based on the image frame text features corresponding to the remaining video frame text features and each image frame feature, and determining the text generation loss;

A loss function of the video frame adapter is determined according to the frame-text matching loss, the frame-text contrast loss and the text generation loss.

6. The video language model training method according to claim 5, characterized in that the determining the frame-text contrast loss by comparing the similarity between the image frame features and the image frame text features comprises:

The positively matched image frame features and image frame text features are taken as a group of positive samples, and the negatively mismatched image frame features and image frame text features are taken as a group of negative samples;

Calculate the positive similarity between the image frame features and the image frame text features in each group of positive samples, and calculate the negative similarity between the image frame features and the image frame text features in each group of negative samples;

A frame-text contrast loss is determined by comparing the positive similarity and the negative similarity.

7. The video language model training method according to claim 5, characterized in that the determining the frame-text contrast loss by comparing the similarity between the image frame features and the image frame text features comprises:

The contrast loss function relationship is called to calculate the frame-text contrast loss; the contrast loss function relationship is:

;

where Loss _ITG is _the frame-text contrast loss, _exp represents the exponential function, Zi is the i -th image frame feature, Ti is the image frame text feature that matches the i -th image frame feature, Tj is the _j - th image frame feature that does not match the image frame text feature, N _ITG is the total number of matches between image frame text features and image frame features, θ represents the similarity between image frame features and image frame text features, and τ is the parameter to be optimized.

8. The video language model training method according to claim 5, characterized in that the loss function of the video frame adapter is determined according to the frame-text matching loss, the frame-text contrast loss and the text generation loss, comprising:

Determine an image frame-image frame text loss according to the frame-text matching loss, the frame-text comparison loss and the text generation loss;

Masking a target image frame of the video sample, predicting the target image frame through a video frame adapter trained based on image frame text features corresponding to the masked video sample and features of each image frame, and determining a video frame mask loss;

A loss function of the video frame adapter is determined based on the image frame-to-image frame text loss and the video frame mask loss.

9. The video language model training method according to claim 8, wherein determining the video frame mask loss comprises:

The video frame mask loss function relationship is called to calculate the video frame mask loss; the video frame mask loss function relationship is:

;

Among them, Loss _MTF is the video frame mask loss, is the expectation of random distribution within a small batch of video samples, D represents random distribution, V represents image frame features, V _m represents target image frame, O( V _m ) represents target image frame features, /> is the image frame feature of the video sample that is not masked, T represents/> The corresponding image frame text features, /> is the image frame text feature of the target image frame, k is the kth image frame feature that is masked inside the small batch video sample, K is the number of images that are masked inside the small batch video sample, and model represents the prediction result.

10. The video language model training method according to claim 1, characterized in that the feature fusion layer includes a first video feature enhancement layer, a cross-modal learning layer and a second video feature enhancement layer;

The first video feature enhancement layer is used to perform residual connection on the video parameter coding feature and the video parameter feature, and perform layer normalization processing to obtain parameter enhancement features;

The cross-modal learning layer is used to fuse the video parameter encoding feature and the joint feature based on a cross-modal attention mechanism, taking the parameter enhancement feature as a query vector and the joint feature as a set of value vectors and key vectors to obtain a multi-modal fusion feature;

The second video feature enhancement layer performs residual connection on the multimodal fusion features and performs layer normalization processing to obtain a fusion processing result.

11. The video language model training method according to claim 9, characterized in that the video language model further comprises a docking network layer; the docking network layer comprises a first converter model, a video feature extraction layer and a joint layer;

Among them, the first converter model is used to fuse the visual features based on the self-attention mechanism to obtain visual fusion features; the video feature extraction layer is used to extract features from the visual fusion features and convert the dimensions of the extracted features into the same dimensions as the input dimensions of the video adapter; the joint layer is used to combine the frame visual information and the output of the video feature extraction layer, and input the joint features into the video adapter.

12. The video language model training method according to claim 9, characterized in that the text description tag includes a video description text tag, the text semantic feature corresponding to the video description text tag is a video text feature, and the training process of the video adapter includes:

Extracting video features of the video visual information;

Extracting encoded text features corresponding to the video text features;

The video adapter is iteratively updated according to the loss information between the video feature and the encoded text feature.

13. The video language model training method according to claim 12, characterized in that the loss information between the video feature and the encoded text feature comprises:

The video-text loss calculation formula is called to calculate the video-text loss of the video adapter, and the video-text loss calculation formula is:

;

Wherein, Loss _G is the video-text loss, _NG is the total number of matches between the video features and the encoded text features in the current batch, is the i 'th video feature in the current batch,/> is the encoded text feature that matches the i 'th video feature,/> is the j'th encoded text feature that does not match the i'th video feature, θ represents the similarity between the video feature and the encoded text feature, and τ is the parameter to be optimized.

14. The video language model training method according to claim 1, characterized in that the step of inputting the video samples in the video sample data set, the video parameters to be learned, and the frame parameters to be learned into the video language model comprises:

Performing image sampling processing on the video sample to obtain multiple frames of sample images;

Extracting image features of each frame of sample image using an image encoder of the target visual language pre-training model to obtain visual features;

Extracting text semantic features of the text description label of the video sample using the text encoder of the target visual language pre-trained model;

The text encoder of the target visual language pre-training model is used to extract parameter features corresponding to the video parameters to be learned and the frame parameters to be learned respectively.

15. The video language model training method according to claim 14, characterized in that the text encoder using the target visual language pre-training model extracts parameter features corresponding to the video parameters to be learned and the frame parameters to be learned, respectively, comprising:

Using the text encoder of the target visual language pre-training model to randomly initialize the frame parameters to be learned, and using the random initialization results of the frame parameters to be learned as frame parameter features;

The text encoder of the target visual language pre-training model is used to encode the video parameters to be learned based on the current attention mask to obtain video parameter features.

16. The video language model training method according to claim 14, characterized in that the text description label includes a video description text label and a video frame description text label, and the text encoder of the target visual language pre-training model is used to extract the text semantic features of the text description label of the video sample, comprising:

Extracting video text features of the video description text tag using a text encoder of the target visual language pre-trained model;

Using the text encoder of the target visual language pre-trained model to perform lemma processing on the video frame description text label, and performing word embedding processing on the lemma processing result to obtain video frame text features;

The text encoder of the target visual language pre-trained model is used to encode the video description text label based on the current attention mask to obtain video text features.

17. The video language model training method according to claim 14, characterized in that the step of extracting image features of each frame of sample image using the image encoder of the target visual language pre-training model to obtain visual features comprises:

Divide the current frame image into multiple image blocks with non-overlapping contents;

Convert each graphic block into a one-dimensional representation through linear mapping, and add position encoding information to the corresponding image block;

The image block after linear mapping and position encoding is input into the encoder of the second transformer model, and feature extraction is performed on the output of the encoder of the second transformer model to obtain the visual features of the video sample.

18. The video language model training method according to any one of claims 1 to 17, characterized in that the parameter features corresponding to the frame parameters to be learned are frame parameter features, the text description labels include video frame description text labels and video description text labels, the text semantic features corresponding to the video frame description text labels are video frame text features, the text semantic features corresponding to the video description text labels are video text features, and the training process of the video language model comprises:

Taking the video frame description text label, the frame parameters to be learned, and the video sample data set as input, freezing the image encoder of the target visual language pre-trained model, and using the frame parameters to be learned to obtain visual information corresponding to the video frame description text label, the video frame adapter is trained;

When the video frame adapter training is completed, the video frame description text label, the video frame description text label, the frame parameters to be learned, the video parameters to be learned, and the video sample data set are used as input to train the video adapter.

19. The video language model training method according to claim 18, wherein the training of the video frame adapter comprises:

Call the video frame adapter loss function to train the video frame adapter; the video frame adapter loss function is:

;

Among them, Loss _frame represents _the video frame adapter loss function, Loss _ITM is the frame-text matching loss, Loss _ITC is the text generation loss, Loss _ITG is the frame-text contrast loss, Loss _MEF is the video frame mask loss, α0 is the frame-text matching loss coefficient, α1 is the text generation loss coefficient, α2 is the frame-text contrast loss coefficient, _and β is _the video frame mask loss coefficient.

20. The video language model training method according to claim 19, wherein the training of the video frame adapter comprises:

Calling a video language loss function to train the video adapter; the video language loss function is:

;

Among them, Loss represents the video language loss function, α is the coefficient of the video frame adapter loss function, Loss _G is the video-text loss, and γ is the video-text loss function.

21. A video language task execution method, characterized by comprising:

Using the video language model training method according to any one of claims 1 to 20, training a video language model;

Obtaining a video language task to be performed and a corresponding video language task training sample set;

Based on the video language task, fine-tuning the video language model using the video language task training sample set;

The video language task is performed using the fine-tuned video language model.

22. The video language task execution method according to claim 21, characterized in that the video language task to be executed is a video content understanding task, and the video language task training sample set is a video sample set of multiple video samples carrying video content labels; based on the video language task, fine-tuning the video language model using the video language task training sample set comprises:

Based on the video content understanding task, the video language model is fine-tuned using the video sample set, so as to perform the video content understanding task using the fine-tuned video language model.

23. A video language model training device, comprising:

A data acquisition module is used to acquire a video sample data set carrying text description labels, pre-set video parameters to be learned, and frame parameters to be learned;

The input data processing module is used to input the video samples in the video sample data set, the video parameters to be learned and the frame parameters to be learned into the video language model; the video language model includes a target visual language pre-training model, a video frame adapter and a video adapter; the target visual language pre-training model is used to extract visual features and parameter features and input them into the video frame adapter and the video adapter respectively, the video frame adapter is used to convert the visual features into frame visual information that meets the requirements of the target visual language pre-training model, and the video adapter is used to extract video visual information; wherein the parameter features corresponding to the video parameters to be learned are input into the The video adapter, the parameter features corresponding to the frame parameters to be learned are input into the video frame adapter, so as to use the frame parameters to be learned to obtain text-related visual information; wherein the image encoder of the target visual language pre-training model is used to extract the visual features of the video sample, and the text encoder of the target visual language pre-training model is used to extract the text semantic features of the text description label of the video sample, and extract the parameter features corresponding to the video parameters to be learned and the frame parameters to be learned; the model parameter updating module is used to iteratively update the video language model according to the loss information of the frame visual information, the video visual information and the text semantic features until the preset model training end condition is met;

Among them, the model parameter updating module is further used for: each video sample has a text description label, the video description text label is the text data corresponding to the entire video, and the video frame description text label is the text data corresponding to the current frame image. When the frame visual information and video visual information corresponding to the video sample are obtained, they are compared with the text semantic features of the text description label corresponding to the video sample, and the model parameters of the video language model are updated by continuously narrowing the gap between the two.

24. A video language task execution device, characterized by comprising:

A model training module, used to train a video language model using the video language model training method according to any one of claims 1 to 20;

A data acquisition module, used to acquire the video language task to be executed and the corresponding video language task sample set;

A model fine-tuning module, used to fine-tune the video language model based on the video language task using the video language task sample set;

The task execution module is used to execute the video language task using the fine-tuned video language model.

25. An electronic device, characterized in that it comprises a processor and a memory, wherein the processor is used to implement the steps of the video language model training method according to any one of claims 1 to 20 and/or the video language task execution method according to claim 22 or 21 when executing the computer program stored in the memory.

26. A readable storage medium, characterized in that a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, the steps of the video language model training method according to any one of claims 1 to 20 and/or the video language task execution method according to claim 21 or 22 are implemented.