CN114494980B

CN114494980B - Diverse video comment generation method, system, device and storage medium

Info

Publication number: CN114494980B
Application number: CN202210352708.8A
Authority: CN
Inventors: 毛震东; 张勇东; 符凤仪; 方山城
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2022-07-15
Anticipated expiration: 2042-04-06
Also published as: CN114494980A

Abstract

The invention discloses a method, system, device and storage medium for generating diverse video comments. In view of the problem of one-sided and single comments generated by the current video comment generating model, from the perspective of emotional diversity, the weight of emotional categories is introduced for labeling. The idea of variational autoencoder, modeling and controlling the emotional latent vector to guide the generation of diverse video comments with controllable emotions, can achieve high-quality real-time video comment generation, and can enhance the user's communication experience.

Description

Diverse video comment generation method, system, device and storage medium

技术领域technical field

本发明涉及视频评论生成技术领域，尤其涉及一种多样化视频评论生成方法、系统、设备及存储介质。The present invention relates to the technical field of video comment generation, and in particular, to a method, system, device and storage medium for generating diverse video comments.

背景技术Background technique

随着时代的发展，视频弹幕系统陆续登陆了BiliBili、爱奇艺、优酷等热门视频平台。弹幕系统的广泛应用，创建了用户与视频之间的双向交流模式，增强了用户在视频观看过程中的实时参与感。视频的实时弹幕可以提供更丰富的观点角度，引起用户的关注与讨论，增强用户的交流体验。因此，实现高质量的实时视频评论（弹幕）生成，具有重大的应用价值。With the development of the times, the video barrage system has successively landed on popular video platforms such as BiliBili, iQiyi, and Youku. The wide application of the bullet screen system creates a two-way communication mode between users and videos, which enhances users' real-time participation in the video viewing process. The real-time barrage of videos can provide richer viewpoints, attract users' attention and discussion, and enhance users' communication experience. Therefore, the realization of high-quality real-time video comment (barrage) generation has great application value.

目前的实时视频评论生成方法多采用传统的端到端模型，结合视频片段与邻近弹幕，生成实时评论。然而，遵循评论的生成逻辑，对于同一视频片段，受到评论者观点角度、感情倾向、思维模式的影响，评论呈现出多样化的特点。当前的实时视频评论生成方法，多只针对评论的质量进行优化，却忽略了评论的多样性特征，只生成单一的视频评论。对于同一个视频片段及邻近评论输入，作为Ground Truth（标注信息）的参考评论往往包含多种类型，模型生成单一的评论不仅不利于性能评估和模型优化，也不符合评论的逻辑特性。Current real-time video comment generation methods mostly use traditional end-to-end models, combining video clips and adjacent bullet screens to generate real-time comments. However, following the logic of comment generation, for the same video clip, it is affected by the commenter's point of view, emotional inclination, and thinking mode, and the comments show diverse characteristics. Most of the current real-time video comment generation methods only optimize for the quality of comments, but ignore the diversity characteristics of comments, and only generate a single video comment. For the same video clip and adjacent comment input, the ground truth (labeling information) reference comments often contain multiple types. The model generates a single comment, which is not only not conducive to performance evaluation and model optimization, but also does not meet the logical characteristics of comments.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种多样化视频评论生成方法、系统、设备及存储介质，实现了可控情感倾向的多样化视频评论生成。The purpose of the present invention is to provide a method, system, device and storage medium for generating diverse video comments, which realizes the generation of diverse video comments with controllable emotional tendencies.

本发明的目的是通过以下技术方案实现的：The purpose of this invention is to realize through the following technical solutions:

一种多样化视频评论生成方法，包括：A diverse video comment generation method including:

利用当前时刻的视频帧图像及其若干最邻近视频帧图像构造视频帧图像集合，提取当前时刻的视频帧图像中的评论做为参考评论，提取所有最邻近视频帧图像中的评论构成评论文本；Utilize the video frame image of the current moment and its several nearest neighboring video frame images to construct a video frame image set, extract the comments in the video frame image of the current moment as a reference comment, and extract the comments in all the nearest video frame images to form a comment text;

从所述视频帧图像集合中提取视觉特征，结合所述视觉特征从所述评论文本中提取文本特征，以及结合参考评论对应的情感类别权重，生成情感隐向量，并编码为情感隐向量编码特征；Extract visual features from the video frame image set, extract text features from the comment text in combination with the visual features, and combine the emotion category weights corresponding to the reference comments to generate emotional latent vectors and encode them as emotional latent vector encoding features ;

将输入的词汇，依次与之前时间步的生成词汇、所述视觉特征、文本特征及情感隐向量编码特征进行交互，获得当前时间步的词汇概率分布，根据当前时间步的词汇概率分布确定当前时间步的生成词汇，综合所有时间步的生成词汇构成当前时刻的视频评论；其中，所述输入的词汇为参考评论中的词汇或者之前时间步的生成词汇中的词汇。Interact the input vocabulary with the generated vocabulary of the previous time step, the visual feature, the text feature and the emotional latent vector coding feature in turn to obtain the vocabulary probability distribution of the current time step, and determine the current time according to the vocabulary probability distribution of the current time step. The generated vocabulary of the step is combined with the generated vocabulary of all time steps to form the video comment at the current moment; wherein, the input vocabulary is the vocabulary in the reference comment or the vocabulary in the generated vocabulary of the previous time step.

一种多样化视频评论生成系统，包括：A diverse video comment generation system including:

信息获取单元，用于利用当前时刻的视频帧图像及其若干最邻近视频帧图像构造视频帧图像集合，提取当前时刻的视频帧图像中的评论做为参考评论，提取所有最邻近视频帧图像中的评论构成评论文本；The information acquisition unit is used to construct a video frame image set using the video frame image at the current moment and some of its nearest neighboring video frame images, extract the comments in the video frame image at the current moment as a reference comment, and extract all the most adjacent video frame images. The comments constitute the comment text;

视觉编码器，用于从所述视频帧图像集合中提取视觉特征；a visual encoder for extracting visual features from the set of video frame images;

文本编码器，用于结合所述视觉特征从所述评论文本中提取文本特征；a text encoder for extracting text features from the review text in combination with the visual features;

隐向量编码器，用于结合参考评论对应的情感类别权重，生成情感隐向量，并编码为情感隐向量编码特征；The latent vector encoder is used to combine the emotion category weights corresponding to the reference comments to generate emotional latent vectors and encode them as emotional latent vector coding features;

评论解码器，用于将输入的词汇，依次与之前时间步的生成词汇、所述视觉特征、文本特征及情感隐向量编码特征进行交互，获得当前时间步的词汇概率分布；其中，所述输入的词汇为参考评论中的词汇或者之前时间步的生成词汇中的词汇；The comment decoder is used to interact the input vocabulary with the generated vocabulary of the previous time step, the visual feature, the text feature and the emotional latent vector coding feature in turn, so as to obtain the vocabulary probability distribution of the current time step; wherein, the input The vocabulary of is the vocabulary in the reference review or the vocabulary in the generated vocabulary of the previous time step;

视频评论生成单元，用于根据当前时间步的词汇概率分布确定当前时间步的生成词汇，综合所有时间步的生成词汇构成当前时刻的视频评论。The video comment generating unit is used for determining the generated words at the current time step according to the word probability distribution of the current time step, and synthesizing the generated words at all time steps to form the video comment at the current moment.

一种处理设备，包括：一个或多个处理器；存储器，用于存储一个或多个程序；A processing device, comprising: one or more processors; a memory for storing one or more programs;

其中，当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现前述的方法。Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the aforementioned method.

一种可读存储介质，存储有计算机程序，其特征在于，当计算机程序被处理器执行时实现前述的方法。A readable storage medium storing a computer program, characterized in that the aforementioned method is implemented when the computer program is executed by a processor.

由上述本发明提供的技术方案可以看出，针对当前视频评论生成模型所生成评论片面单一的问题，从情感多样性方面出发，引入了情感类别权重作为情感标注，借鉴变分自编码器的思想，建模控制情感隐向量以引导情感可控的多样化视频评论生成，可以实现高质量的实时视频评论生成，能够增强用户的交流体验。It can be seen from the above technical solutions provided by the present invention that, in view of the problem of one-sided and single comments generated by the current video comment generation model, from the perspective of emotional diversity, the emotional category weight is introduced as emotional annotation, and the idea of variational autoencoder is used for reference. , modeling and controlling the emotional latent vector to guide the generation of emotionally controllable and diverse video comments, which can achieve high-quality real-time video comment generation and enhance the user's communication experience.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的一种多样化视频评论生成方法的流程图；1 is a flowchart of a method for generating diverse video comments provided by an embodiment of the present invention;

图2为本发明实施例提供的一种多样化视频评论生成模型的整体结构示意图；2 is a schematic diagram of the overall structure of a diversified video comment generation model provided by an embodiment of the present invention;

图3为本发明实施例提供的一种多样化视频评论生成系统的示意图；3 is a schematic diagram of a system for generating diverse video comments provided by an embodiment of the present invention;

图4为本发明实施例提供的一种处理设备的示意图。FIG. 4 is a schematic diagram of a processing device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

首先对本文中可能使用的术语进行如下说明：First a description of terms that may be used in this article:

术语“包括”、“包含”、“含有”、“具有”或其它类似语义的描述，应被解释为非排它性的包括。例如：包括某技术特征要素（如原料、组分、成分、载体、剂型、材料、尺寸、零件、部件、机构、装置、步骤、工序、方法、反应条件、加工条件、参数、算法、信号、数据、产品或制品等），应被解释为不仅包括明确列出的某技术特征要素，还可以包括未明确列出的本领域公知的其它技术特征要素。The terms "comprising", "comprising", "containing", "having" or other descriptions with similar meanings should be construed as non-exclusive inclusions. For example: including certain technical characteristic elements (such as raw materials, components, ingredients, carriers, dosage forms, materials, dimensions, parts, components, mechanisms, devices, steps, processes, methods, reaction conditions, processing conditions, parameters, algorithms, signals, data, products or products, etc.), should be construed to include not only certain technical feature elements explicitly listed, but also other technical feature elements known in the art that are not explicitly listed.

下面对本发明所提供的一种多样化视频评论生成方法、系统、设备及存储介质进行详细描述。本发明实施例中未作详细描述的内容属于本领域专业技术人员公知的现有技术。本发明实施例中未注明具体条件者，按照本领域常规条件或制造商建议的条件进行。The following describes in detail a method, system, device and storage medium for generating a variety of video comments provided by the present invention. Contents that are not described in detail in the embodiments of the present invention belong to the prior art known to those skilled in the art. If the specific conditions are not indicated in the examples of the present invention, it is carried out according to the conventional conditions in the art or the conditions suggested by the manufacturer.

实施例一Example 1

如图1所示，一种多样化视频评论生成方法，主要包括如下步骤：As shown in Figure 1, a method for generating diverse video comments mainly includes the following steps:

步骤1、利用当前时刻的视频帧图像及其若干最邻近视频帧图像构造视频帧图像集合，提取当前时刻的视频帧图像中的评论做为参考评论，提取所有最邻近视频帧图像中的评论构成评论文本。Step 1, utilize the video frame image of the current moment and its several nearest adjacent video frame images to construct a video frame image set, extract the comments in the video frame image of the current moment as a reference comment, and extract the comments in all the nearest video frame images to form Comment text.

步骤2、从所述视频帧图像集合中提取视觉特征，结合所述视觉特征从所述评论文本中提取文本特征，以及结合参考评论对应的情感类别权重，生成情感隐向量，并编码为情感隐向量编码特征。Step 2, extracting visual features from the video frame image collection, extracting text features from the comment text in combination with the visual features, and combining the emotion category weights corresponding to the reference comments to generate an emotional latent vector, and encode it as emotional latent vector. Vector encoding features.

步骤3、将输入的词汇，依次与之前时间步的生成词汇、所述视觉特征、文本特征及情感隐向量编码特征进行交互，获得当前时间步的词汇概率分布，根据当前时间步的词汇概率分布确定当前时间步的生成词汇，综合所有时间步的生成词汇构成当前时刻的视频评论；其中，所述输入的词汇为参考评论中的词汇或者之前时间步的生成词汇中的词汇。Step 3. Interact the input vocabulary with the generated vocabulary of the previous time step, the visual feature, the text feature and the emotional latent vector coding feature in turn, to obtain the vocabulary probability distribution of the current time step, according to the vocabulary probability distribution of the current time step. The generated vocabulary of the current time step is determined, and the generated vocabulary of all time steps is combined to form the video comment at the current moment; wherein, the input vocabulary is the vocabulary in the reference comment or the vocabulary in the generated vocabulary of the previous time step.

本发明实施例上述方案中，从所述视频帧图像集合中提取视觉特征通过视觉编码器实现；结合所述视觉特征从所述评论文本中提取文本特征通过文本编码器实现；结合参考评论获取相应的情感类别权重，生成情感隐向量，并编码为情感隐向量编码特征通过隐向量编码器实现；将输入的词汇，依次与之前时间步的生成词汇、所述视觉特征、文本特征及情感隐向量编码特征进行交互，根据获得的当前时间步的词汇概率分布通过评论解码器实现。由上述视觉编码器、文本编码器、隐向量编码器与评论解码器构成多样化视频评论生成模型，图2展示了多样化视频评论生成模型的整体结构，下面结合图2对多样化视频评论生成模型各个部分的工作原理以及训练时的损失函数做详细的介绍。In the above solution of the embodiment of the present invention, the visual feature extraction from the video frame image set is implemented by a visual encoder; the textual feature extraction from the comment text in combination with the visual feature is implemented by a text encoder; The emotion category weights are generated, and the emotion latent vector is generated and encoded into the emotion latent vector. The encoding feature is realized by the latent vector encoder; the input vocabulary is sequentially combined with the generated vocabulary, the visual feature, the text feature and the emotional latent vector of the previous time step. The encoded features interact with each other, which is implemented by a comment decoder based on the obtained word probability distribution at the current time step. The above-mentioned visual encoder, text encoder, latent vector encoder and comment decoder constitute a diversified video comment generation model. Figure 2 shows the overall structure of the diversified video comment generation model. The working principle of each part of the model and the loss function during training are introduced in detail.

一、视觉编码器。1. Visual encoder.

如图2所示，视觉编码器（Video Encoder）主要包括卷积神经网络（CNN）与第一Transformer模型，第一Transformer模型主要包括多头注意模块（Multi-head Attention）子层和全连接前馈网络（Position-wise Feed-Forward Networks）。As shown in Figure 2, the Video Encoder mainly includes a Convolutional Neural Network (CNN) and the first Transformer model. The first Transformer model mainly includes a Multi-head Attention sublayer and a fully connected feedforward. Networks (Position-wise Feed-Forward Networks).

本发明实施例中，通过视觉编码器从所述视频帧图像集合中提取视觉特征。将视频帧图像集合记为F= {F ₁ , F ₂ ,…, F _J}，其中，F _j表示第j个视频帧图像，每一视频帧图像对应一个时刻，j=1,2,…,J，J表示视频帧图像数目；视频帧图像集合F为指定视频包含指定时刻视频帧及其最邻近时刻的视频帧图像集合，F ₁即为当前时刻的视频帧图像，后续F ₂ ,…, F _J即为与当前时刻的视频帧图像最邻近时刻的J-1个视频帧图像。In this embodiment of the present invention, a visual feature is extracted from the video frame image set by using a visual encoder. Denote the set of video frame images as F= { F ₁ , F ₂ ,..., F _J }, where F _j represents the j -th video frame image, and each video frame image corresponds to a moment, j =1,2,... , J , J represent the number of video frame images; the video frame image set F is the video frame image set in which the specified video includes the video frame at the specified moment and its nearest neighbor moment, F ₁ is the video frame image at the current moment, and the subsequent F ₂ ,... , F _J is the J -1 video frame image at the closest moment to the video frame image at the current moment.

首先，通过卷积神经网络提取每一个视频帧图像（image）的特征，表示为：First, the features of each video frame image (image) are extracted through a convolutional neural network, which is expressed as:

V _j = CNN(F _j) V _j = CNN( F _j )

上式中，V _j表示提取的第j个视频帧图像F _j的特征。示例性的，卷积神经网络可以使用ResNet网络。In the above formula, V _j represents the feature of the extracted j -th video frame image F _j . Exemplarily, the convolutional neural network can use a ResNet network.

记视频帧图像集合对应的特征集合V = {V ₁ , V ₂ ,…, V _J}，通过第一Transformer模型对视频帧图像集合F的特征集合V进行编码处理，表示为：Denote the feature set V = { V ₁ , V ₂ ,..., V _J } corresponding to the video frame image set, and encode the feature set V of the video frame image set F through the first Transformer model, which is expressed as:

W _j =FNN_F(MultiHead-Atten_F(V _j ，V，V) ) W _j = FNN _F (MultiHead-Atten _F ( V _j , V, V ) )

上式中，MultiHead-Atten_F与FNN_F分别表示所述第一Transformer模型中的多头注意力模块与全连接前馈网络；W _j表示编码处理得到的第j个视频帧图像的视觉特征。In the above formula, MultiHead-Atten _F and FNN _F respectively represent the multi-head attention module and the fully connected feedforward network in the first Transformer model; W _j represents the visual feature of the jth video frame image obtained by encoding processing.

最终将视频帧图像集合F的视觉特征记为W _F= {W ₁ , W ₂ ,…, W _J}。Finally, the visual features of the video frame image set _F are denoted as WF = { W 1 _, W ₂ ,..., W _J }.

本领域技术人员可以理解，(V _j ，V，V)为多头注意力模块的输入信息，依次对应于多头注意力中的query（查询矩阵），key（键矩阵）和value（值矩阵），也即query值取V _j，key和value取V。后续相关公式中所涉及得到的多头注意力模块也是类似含义，考虑到公式中表达形式均为本领域通用形式，故不再赘述。Those skilled in the art can understand that ( V _j , V, V ) is the input information of the multi-head attention module, which in turn corresponds to the query (query matrix), key (key matrix) and value (value matrix) in the multi-head attention, That is, the query value takes V _j , and the key and value take V . The multi-head attention module involved in the subsequent related formulas also has similar meanings. Considering that the expressions in the formulas are all common in the field, they will not be repeated here.

二、文本编码器。Second, the text encoder.

本发明实施例中，所述文本编码器（Text Encoder）包括第一线性编码层与第二Transformer模型，所述第二Transformer模型包括两个多头注意力模块与一个全连接前馈网络；图2中将文本编码器以及其他部分中包含的性编码层都表示为Embedding，同样的，所有多头注意力模块都表示为Multi-head Attention，所有全连接前馈网络都表示为FeedForward。In the embodiment of the present invention, the text encoder (Text Encoder) includes a first linear encoding layer and a second Transformer model, and the second Transformer model includes two multi-head attention modules and a fully connected feedforward network; Figure 2 The text encoder and the sex-encoding layers contained in other parts are denoted as Embedding, similarly, all multi-head attention modules are denoted as Multi-head Attention, and all fully connected feed-forward networks are denoted as FeedForward.

本发明实施例中，通过所述文本编码器结合所述视觉特征从所述评论文本（context）中提取文本特征。In this embodiment of the present invention, text features are extracted from the comment text (context) by using the text encoder in combination with the visual features.

首先，通过第一线性编码层对所述评论文本进行线性编码，获得对应的词嵌入向量集合e = {e ₁ , e ₂ ,…, e _M}，其中，e _m表示评论文本中的第m个词汇的词嵌入向量，m=1,2,…,M ，M为评论文本词汇总数。本发明实施例中，评论文本可以是弹幕文本，一般为指定视频指定时刻邻近的弹幕，此处邻近的范围可由本领域技术人员根据实际需求或者经验进行设定，范围越大词汇数目越多。参照前述视觉编码器中的介绍，指定时刻为当前时刻，则F ₁即为当前时刻的视频帧，那么评论文本包含的是视频帧图像F ₂ ,…, F _J中的评论。First, linearly encode the comment text through the first linear encoding layer to obtain a corresponding set of word embedding vectors e = { e ₁ , e ₂ ,..., e M _} , where em represents the _mth word in the comment text The word embedding vector of each vocabulary, m =1,2,…, M , where M is the total number of review text vocabulary. In this embodiment of the present invention, the comment text may be a bullet screen text, which is generally a bullet screen near a specified time of the specified video. The adjacent range here can be set by those skilled in the art according to actual needs or experience. The larger the range, the greater the number of words. many. Referring to the introduction in the aforementioned visual encoder, the specified moment is the current moment, then F ₁ is the video frame at the current moment, then the comment text contains the comments in the video frame images F ₂ , . . . , F _J .

之后，通过第二Transformer模型中的第一个多头注意力模块MultiHead-Atten_e1对词嵌入向量集合e进行处理，再通过第二个多头注意力模块MultiHead-Atten_e2与全连接前馈网络FNN_e将第一个多头注意力模块的处理结果与所述视觉特征进行交互，得到文本特征，处理过程表示为：After that, the word embedding vector set e is processed through the first multi-head attention module MultiHead-Atten _e1 in the second Transformer model, and then through the second multi-head attention module MultiHead-Atten _e2 and the fully connected feedforward network FNN _e Interacting the processing results of the first multi-head attention module with the visual features to obtain text features, the processing process is expressed as:

e _m’= MultiHead-Atten_e1(e _m ，e，e)em '= _MultiHead - _Atten _e1 ( em , e , e )

E _m =FNN_e(MultiHead-Atten_e2(e _m’，W _F ，W _F ) E _m = FNN _e (MultiHead-Atten _e2 ( e _m ' , W _F , W _F )

其中，e _m’表示第一个多头注意力模块对第m个词汇的词嵌入向量e _m的处理结果，W _F表示视觉特征，E _m表示对应于第m个词汇的文本特征。Among them, em ' represents the processing result of the word embedding vector em of the mth _word by the first multi-head attention module, WF _represents the visual feature, and _Em _represents the text feature corresponding to the mth word .

最终将所述评论文本的文本特征记为W _e = {E ₁ , E ₂ ,…, E _M}。Finally, the text feature of the review text is recorded as We = { E ₁ , E ₂ _, ..., E M _} .

三、隐向量编码器。Third, the hidden vector encoder.

本发明实施例中，隐向量编码器（Latent Vector Encoder）在Transformer模型的基础上引入了变分自编码器的编码原理，通过训练一个混合高斯分布，采样情感隐向量z指导多样化评论的生成。隐向量编码器根据参考评论（comment）、文本特征W _e与情感类别权重生成情感隐向量z，再编码为情感隐向量编码特征；主要原理如下：In the embodiment of the present invention, the latent vector encoder (Latent Vector Encoder) introduces the coding principle of the variational auto-encoder on the basis of the Transformer model. By training a mixture of Gaussian distribution, the sampling emotional latent vector z guides the generation of diverse comments . The latent vector encoder generates the emotional latent vector z according to the reference comment (comment), the text feature We and the emotional _category weight, and then encodes it into the emotional latent vector encoding feature; the main principles are as follows:

将情感隐向量的概率分布p(z|c, W _e)可建模为使用情感类别权重c _k加权的混合高斯分布模型，表示为：The probability distribution p(z| c , We) of the sentiment _latent vector can be modeled as a mixture Gaussian distribution model weighted with the sentiment category weight _ck , expressed as:

其中，c _k表示第k个情感类别权重，K表示情感类别权重的数目, c表示情感类别权重集合，c={c _k}_K，

表示第k个高斯分布模型，

与

分别表示建模定义的高斯分布模型的均值与方差，I表示标准单位矩阵，W _e表示文本特征；z表示情感隐向量。Among them, _ck represents the k -th emotional category weight, K represents the number of emotional category weights, c represents the set of emotional category weights, c={ c _k } _K ,

represents the kth Gaussian distribution model,

and

respectively represent the mean and variance of the Gaussian distribution model defined by the modeling, I represents the standard unit matrix, We represent the text feature; _z represents the emotional latent vector.

如图2所示，所述隐向量编码器包括：两个线性编码层、第三Transformer模型、多层感知器（MLP）与采样层（sample）。As shown in FIG. 2 , the latent vector encoder includes: two linear coding layers, a third Transformer model, a multi-layer perceptron (MLP) and a sampling layer (sample).

两个线性编码层（Embedding）分别称为第二线性编码层与第三线性编码层，位于隐向量编码器首尾两端。通过第二线性编码层对参考评论进行线性编码，获得对应的词嵌入向量集合d= {d ₁ , d ₂ , …, d _L}，其中，d _l表示参考评论中第l个词汇的词嵌入向量，l=1,2,…,L ，L为参考评论的词汇总数。The two linear coding layers (Embedding) are respectively called the second linear coding layer and the third linear coding layer, which are located at the beginning and end of the hidden vector encoder. The reference comments are linearly encoded by the second linear encoding layer to obtain the corresponding word embedding vector set d= { d ₁ , d ₂ , …, d _L }, where d _l represents the word embedding of the lth word in the reference comments vector, l = 1,2,…, L , where L is the total number of words in the reference reviews.

还参见图2，所述第三Transformer模型包括两个多头注意力模块与一个全连接前馈网络，通过第一个多头注意力模块MultiHead-Atten_z1对词嵌入向量集合d进行处理，再通过第二个多头注意力模块MultiHead-Atten_z2与全连接前馈网络FNN_z将第一个多头注意力模块的处理结果与所述文本特征进行交互，获得中间隐向量集合h，处理过程表示为：Referring also to FIG. 2, the third Transformer model includes two multi-head attention modules and a fully connected feedforward network. The word embedding vector set d is processed through the first multi-head attention module MultiHead-Atten _z1 , and then processed through the first multi-head attention module MultiHead-Atten z1. The two multi-head attention modules MultiHead-Atten _z2 and the fully connected feedforward network FNN _z interact with the processing results of the first multi-head attention module and the text features to obtain an intermediate hidden vector set h , and the processing process is expressed as:

d _l’= MultiHead-Atten_z1(d _l ，d，d) d _l '= MultiHead-Atten _z1 ( d _l , d , d)

h _l =FNN_z(MultiHead-Atten_z2(d _l’，W _e ，W _e) ) h _l = FNN _z ( _MultiHead -Atten _z2 ( d _l ' , We , We ₎ )

其中，d _l’表示第一个多头注意力模块的第l层对参考评论中第l个词汇的词嵌入向量d _l的处理结果；当l=2,…,L时，第一个多头注意力模块的第l层的输入还包含参考评论中第l-1个词汇对应的中间隐向量h _l-1；h _l表示第二个多头注意力模块的第l层输出经全连接前馈网络FNN_z处理得到的中间隐向量，中间隐向量h _l对应于参考评论中第l个词汇；最终处理得到中间隐向量集合h= {h ₁ , h ₂ , …, h _L}，将h _L称为最后一层隐向量。Among them, d _l ' represents the processing result of the l -th layer of the first multi-head attention module on the word embedding vector d _l of the l -th word in the reference review; when l = 2,..., L , the first multi-head attention The input of the l -th layer of the force module also contains the intermediate hidden vector h _{l -1} corresponding to the l -1th word in the reference comment; h _l represents the output of the l -th layer of the second multi-head attention module through a fully connected feedforward network. The intermediate hidden vector obtained by FNN _z processing, the intermediate hidden vector h _l corresponds to the lth word in the reference comment; the final processing obtains the intermediate hidden vector set h = { h ₁ , h ₂ , …, h _L }, and h _L is called is the last hidden vector.

本领域技术人员可以理解，当l=2,…,L时，计算d _l’的过程中第一个多头注意力模块MultiHead-Atten_z1包含两类输入，一类为d _l与d，另一类为h _l-1，也就是说，第一个多头注意力模块MultiHead-Atten_z1会结合第二个多头注意力模块MultiHead-Atten_z2与全连接前馈网络FNN_z的输出对词嵌入向量集合d进行处理，但由于这一机制已经涵盖在多头注意力模块中，因此，并未通过式子来展示。Those skilled in the art can understand that when l =2,..., L , the first multi-head attention module MultiHead-Atten _z1 in the process of calculating d _l ' includes two types of inputs, one is d _l and d, the other The class is h _{l -1} , that is, the first multi-head attention module MultiHead-Atten _z1 will combine the output of the second multi-head attention module MultiHead-Atten _z2 and the fully connected feedforward network FNN _z to the set of word embedding vectors d is processed, but since this mechanism is already covered in the multi-head attention module, it is not shown by the formula.

通过多层感知器将最后一层隐向量h _L编码为高斯分布模型的均值与方差，表示为：The last layer of latent vector h _L is encoded as the mean and variance of the Gaussian distribution model through the multi-layer perceptron, which is expressed as:

其中，MLP表示多层感知器。where MLP stands for Multilayer Perceptron.

结合参考评论获取的情感类别权重c _k与多层感知器编码得到的高斯分布模型的均值

与方差

，带入式子p(z|c, W _e)获得情感隐向量z的概率分布；通过采样层采样获得情感隐向量z，再通过第三线性编码层编码为情感隐向量编码特征W _z。Combining the sentiment category weights ck obtained from the reference reviews and the mean of the Gaussian distribution model encoded by the multilayer _perceptron

with variance

, and the probability distribution of the emotional latent vector _{z is obtained by entering the formula p(z|c, We); the emotional latent vector z is obtained by sampling the sampling layer, and then encoded into the emotional latent vector encoding feature W z} through _the third linear coding layer .

以上对应于推理阶段，此时模型通过控制

与

尽可能的接近于

与

，可以有效建模选定的隐向量空间与生成评论直接的映射关系，测试阶段通过选择不同的情感类别权重c _k，实现多样化的评论生成。The above corresponds to the inference phase, where the model passes the control

and

as close as possible

and

, which can effectively model the direct mapping relationship between the selected latent _{vector space and the generated comments. In the testing stage, by selecting different sentiment category weights ck} , a variety of comments can be generated.

本发明实施例中，可以采用现成的数据库SnowNLP对参考评论s进行情感分析（Emotion Analysis）评估，SnowNLP (s)输出一个 [0,1]区间的评估值，用来说明参考评论s的多样性的分值，分值越大即代表句子具有的情感更积极。可以将情感倾向分为积极、客观、消极三种方向（K=3），由此可获取对应情感类别的权重c _k：In the embodiment of the present invention, the ready-made database SnowNLP may be used to perform an Emotion Analysis (Emotion Analysis) evaluation on the reference comment s, and SnowNLP (s) outputs an evaluation value in the [0,1] interval, which is used to illustrate the diversity of the reference comment s The higher the score, the more positive sentiment the sentence has. Emotional tendencies can be divided into three directions: positive, objective, and negative ( K = 3 ), from which the weight _ck of the corresponding emotion category can be obtained:

c ₁ = 1for T2<SnowNLP (s)≤T1； c ₁ = 1for T2<SnowNLP(s)≤T1;

c ₂ = 1for T3<SnowNLP (s)≤T2； c ₂ = 1for T3<SnowNLP(s)≤T2;

c ₃ = 1else。 c3 ₌ 1else .

情感类别权重即为c={c _1, c _2, c ₃}，上面式子展示了K=3的情况下，不同情感类别权重等于1的条件，上述式子中T1、T2、T3均为设定的阈值，满足T1>T2>T3，示例性的，可以设置T1=1，T2=0.7，T3=0.3。The emotional category weight is c={ c _1, c _2, c ₃ }. The above formula shows the condition that the weight of different emotional categories is equal to 1 in the case of K=3. In the above formula, T1, T2, and T3 are all The set threshold value satisfies T1>T2>T3. Exemplarily, T1=1, T2=0.7, and T3=0.3 can be set.

本发明实施例中，情感类别权重c _k符号k指示了情感类别，根据当前的参考评论，能够确定对应的情感类别k，其对应的情感类别权重c _k=1，其余情感类别权重均为0，因此，带入式子p(z|c,W _e)时，也只需要情感类别k对应的高斯分布模型的均值

与方差

。In the embodiment of the present invention, the symbol _k of the emotional category weight ck indicates the emotional category, and the corresponding emotional category k can be determined according to the current reference comment, the corresponding emotional category weight ck ₌ 1, and the rest emotional category weights are all 0 , therefore, when bringing into the formula p(z|c, We _e ), only the mean value of the Gaussian distribution model corresponding to the emotion category k is needed

with variance

.

四、评论解码器。Fourth, the comment decoder.

如图2所示，所述评论解码器（Comment Decoder）主要包括第四线性编码层、第四Transformer模型、线性层（未示出）与softmax层。As shown in FIG. 2 , the comment decoder (Comment Decoder) mainly includes a fourth linear coding layer, a fourth Transformer model, a linear layer (not shown) and a softmax layer.

第四线性编码层用于将输入的词汇进行编码，获得输入的词汇对应的词嵌入向量，记为y’；推理阶段直接采样参考评论（comment）词汇作为输入的词汇，测试阶段使用前一时间步生成的词汇作为输入的词汇，如图2所示，给出了以参考评论中词汇作为输入的词汇的示例。The fourth linear coding layer is used to encode the input vocabulary and obtain the word embedding vector corresponding to the input vocabulary, denoted as y'; the inference stage directly samples the reference comment vocabulary as the input vocabulary, and the test stage uses the previous time The vocabulary generated by the step is used as the input vocabulary, as shown in Figure 2, an example of the vocabulary in the reference comment is given as the input vocabulary.

还参见图2，所述第四Transformer模型包括四个多头注意力模块与一个全连接前馈网络；通过第一个多头注意力模块MultiHead-Atten_o1将词嵌入向量y’与之前时间步的生成词汇进行交互，通过第二个多头注意力模块MultiHead-Atten_o2将第一个多头注意力模块的交互结果与所述视觉特征进行交互，通过第三个多头注意力模块MultiHead-Atten_o3将第二个多头注意力模块的交互结果与所述文本特征进行交互，通过第四个多头注意力模块MultiHead-Atten_o4将第三个多头注意力模块的交互结果与所述情感隐向量编码特征进行交互，并通过全连接前馈网络FNN_o输出最终的解码特征，处理过程表示为：Referring also to Fig. 2, the fourth Transformer model includes four multi-head attention modules and a fully connected feedforward network; the word embedding vector y' is generated by the first multi-head attention module MultiHead-Atten _o1 and the previous time step The vocabulary interacts, and the interaction results of the first multi-head attention module are interacted with the visual features through the second multi-head attention module MultiHead-Atten _o2 , and the second multi-head attention module MultiHead-Atten _o3 is used. The interaction results of the multi-head attention modules interact with the text features, and the interaction results of the third multi-head attention module interact with the emotional latent vector encoding features through the fourth multi-head attention module MultiHead-Atten _o4 . And output the final decoded features through the fully connected feedforward network FNN _o , and the processing process is expressed as:

y _-1= MultiHead-Atten_o1(y’，y，y) y _-1 = MultiHead-Atten _o1 (y' , y , y)

y _-2= MultiHead-Atten_o2(y _-1 ，W _F ，W _F) y _-2 = _MultiHead - _Atten _o2 ( y _-1 , WF , WF )

y _-3= MultiHead-Atten_o3(y _-2 ，W _e ，W _e) y _-3 = _MultiHead - _Atten _o3 ( y _-2 , We , We )

s _t =FNN_o(MultiHead-Atten_o4 (y _-3 ，W _z ，W _z) ) s _t = FNN _o (MultiHead-Atten _o4 ( y _-3 , W _z , W _z ) )

其中，y、W _F、W _e、W _z依次表示之前时间步的生成词汇、所述视觉特征、文本特征、情感隐向量编码特征；s _t表示最终的解码特征。 Wherein , y , WF , We, and _Wz represent the generated _vocabulary , the visual feature, the text feature, and the emotional latent vector coding feature of the previous time step in turn; s _t represents the final decoding feature _.

最终的解码特征s _t经线性层与softmax层，获得当前时间步的词汇概率分布，表示为：The final decoded feature s _t is passed through the linear layer and the softmax layer to obtain the vocabulary probability distribution of the current time step, which is expressed as:

p(y _t |y ₀ ,…,y _t−1 , W _z, W _I, W _e) = Softmax(Ws _t) p(y _t |y ₀ ,…,y _{t −1} , W _z , W _I , W _e ) = Softmax( Ws _t )

其中，y ₀ ,…,y _t−1表示初始时间步0至前一时间步t-1生成的词汇，即之前时间步的生成词汇y，y _t表示当前时间步生成的词汇，W表示线性层的参数。Among them, y ₀ ,...,y _{t −1} represents the vocabulary generated from the initial time step 0 to the previous time step t-1, that is, the generated vocabulary y of the previous time step, y _t represents the vocabulary generated at the current time step, and W represents the linear layer parameters.

本发明实施例中，多样化视频评论生成模型各个部分中的多头注意模块所涉及的多头注意力机制均可参照常规技术，相关式子展示的是相关的处理过程，而处理获得各类特征以及中间隐向量的原理均可参照常规技术，本发明不做赘述；此外，上述评论解码器的介绍中，需要使用之前时间步的生成词汇y，在实际计算流程中，之前时间步的生成词汇y需要转换为相应的嵌入向量集合y，但考虑到这一原理是常规技术，并且，式子主要展示的是所需数据，因此，使用目前的展示方式本领域技术人员可以理解式子所表达的含义。In the embodiment of the present invention, the multi-head attention mechanism involved in the multi-head attention module in each part of the diversified video comment generation model can refer to the conventional technology, and the correlation formula shows the relevant processing process, and the processing obtains various features and The principle of the intermediate hidden vector can be referred to the conventional technology, which is not repeated in the present invention; in addition, in the introduction of the above comment decoder, the generated vocabulary y of the previous time step needs to be used. In the actual calculation process, the generated vocabulary y of the previous time step needs to be used. It needs to be converted into the corresponding set of embedded vectors y , but considering that this principle is a conventional technology, and the expression mainly displays the required data, therefore, those skilled in the art can understand the expression expressed by the current display method meaning.

五、损失函数。5. Loss function.

传统的编解码器模型通过最大化生成评论

的对数似然函数实现，而本发明定义的是一个由情感隐向量z控制的生成模型：Traditional encoder-decoder models generate reviews by maximizing

The log-likelihood function of , and the present invention defines a generative model controlled by the emotional latent vector z:

其中，生成评论

={y ₀ , y ₁ ,…}，即为当前时刻的视频评论，它包含当前时刻所有时间步的生成词汇。Among them, generate comments

={ y ₀ , y ₁ ,… } is the video comment at the current moment, which contains the generated vocabulary of all time steps at the current moment.

由于无法遍历所有的情感隐向量z求积分，通过借鉴变分自编码器中的数学推导，可以得到当前时刻的视频评论

的对数似然函数的一个变分下界（ELBO）：Since it is impossible to traverse all the emotional latent vectors z for integration, the video comment at the current moment can be obtained by drawing on the mathematical derivation in the variational autoencoder.

A variational lower bound (ELBO) of the log-likelihood function of :

log(p(

))≥ E _z[log p(

|z)]− D _KL [q(z|

), p(z)] log ( p (

)) ≥ E _z [ log p (

| z)] − D _KL [ q (z |

) , p (z)]

其中，p(

)表示生成当前时刻的视频评论

的概率分布，p(

|z)表示条件为情感隐向量z时生成当前时刻的视频评论

的概率分布，p(z)为情感隐向量z的分布，E _z[.]表示求取关于情感隐向量z的数学期望，D _KL表示计算KL距离（相对熵），q(z|

)对应隐向量编码器所获取的概率分布，用于近似评论解码器的后验概率分布p(z|

)。因此，本发明的优化目标为最大化对数似然函数的变分下界。where, p (

) means generating the video comment at the current moment

The probability distribution of , p (

| z) indicates that the video comment at the current moment is generated when the condition is the emotional latent vector z

The probability distribution of , p (z) is the distribution of the emotional latent vector z, E _z [ . ] represents the mathematical expectation about the emotional latent vector z, D _KL represents the calculation of the KL distance (relative entropy), q (z |

) corresponds to the probability distribution obtained by the latent vector encoder, which is used to approximate the posterior probability distribution p (z |

). Therefore, the optimization objective of the present invention is to maximize the variational lower bound of the log-likelihood function.

由于本发明还受到视频帧图像集合F与评论文本对应的词嵌入向量集合e的影响，目标函数L可化为：Since the present invention is also affected by the set of word embedding vectors e corresponding to the video frame image set F and the comment text, the objective function L can be transformed into:

L = E _{q(z|F, e)}[log p(

|z, F, e )] − D _KL[q(z|

, F, e), p(z|F, e)] L = E _{q (z|F, e)} [log p (

|z, F, e )] − D _KL [q(z|

, F, e), p (z|F, e)]

其中，p(

|z, F, e )表示以情感隐向量z，视频帧图像集合F以及评论文本对应的词嵌入向量集合e作为条件时，生成当前时刻的视频评论

的概率分布；E _{q(z| F, e)}[.]表示求取关于q(z| F, e)的数学期望，q(z| F, e)表示以视频帧图像集合F以及评论文本对应的词嵌入向量e作为条件的情感隐向量z的概率分布；q(z|

, F, e)表示以当前时刻的视频评论

，视频帧图像F集合以及评论文本对应的词嵌入向量集合e作为条件时，情感隐向量z的概率分布；p(z|F, e)表示以视频帧图像集合F以及评论文本对应的词嵌入向量集合e作为条件时，情感隐向量z的概率分布。第一项与传统模型的对数似然函数类似，鼓励生成更高质量的评论，即为重构损失。第二项鼓励训练得到的情感隐向量z分布能够尽可能的接近于先验分布p(z|F, e)，先验概率设置为标准正态分布N(0,1)，避免模型在训练过程中趋于同化，丢失多样性。where, p (

|z, F, e ) means that the video comment at the current moment is generated when the emotional latent vector z, the video frame image set F and the word embedding vector set e corresponding to the comment text are used as conditions

The probability distribution of ; E _{q (z| F, e)} [.] means to find the mathematical expectation about q (z| F, e), q (z| F, e) means to use the video frame image set F and the comment text The corresponding word embedding vector e is used as the probability distribution of the conditional emotional latent vector z; q(z|

, F, e) represent the video comment at the current moment

, when the video frame image set F and the word embedding vector set e corresponding to the comment text are used as conditions, the probability distribution of the emotional latent vector z; p (z|F, e) represents the word embedding corresponding to the video frame image set F and the comment text The probability distribution of the emotional latent vector z when the vector set e is used as the condition. The first term is similar to the log-likelihood function of traditional models and encourages the generation of higher quality reviews, which is the reconstruction loss. The second item encourages the emotional latent vector z distribution obtained by training to be as close as possible to the prior distribution p(z|F, e), and the prior probability is set to the standard normal distribution N(0,1) to avoid the model training The process tends to assimilate and lose diversity.

本发明实施例中， p和q都是概率分布，采用两表达形式主要用于区分不同概率分布；如之前的说明，q(z|

)对应隐向量编码器所获取的情感隐向量z的概率分布，用于近似评论解码器的后验概率分布p(z|

)，但由于后验概率分布p(z|

)不能够计算得到，因此，使用了不同表达形式进行区分。In this embodiment of the present invention, p and q are both probability distributions, and two expressions are used to distinguish different probability distributions; as described above, q (z |

) corresponds to the probability distribution of the sentiment latent vector z obtained by the latent vector encoder, which is used to approximate the posterior probability distribution p (z |

), but due to the posterior probability distribution p (z |

) cannot be calculated, therefore, different expressions are used to distinguish.

本发明实施例上述方案，针对当前视频评论生成模型所生成评论片面单一的问题，从情感多样性方面出发，引入了情感分析模块进行标注，在Transformer模型的基础上借鉴变分自编码器的思想，建模控制情感隐向量z以引导情感可控的多样化视频评论生成。值得注意的是，引入的情感分析模块（即图2中的Emotion Analysis）独立于整个模型，因此可以将其替换成其他的情感分析器，实现更细粒度的可控情感评论生成；或者替换为主题分析模块，实现可控主旨的评论生成等，可见，本发明提供的方案具有较强的推广应用价值。 The above solution of the embodiment of the present invention, aiming at the problem of one-sided and single comments generated by the current video comment generation model, from the perspective of emotional diversity, an emotional analysis module is introduced for labeling, and based on the Transformer model, the idea of a variational autoencoder is used for reference. , modeling and controlling the latent vector z of emotion to guide emotion-controllable and diverse video comment generation. It is worth noting that the introduced sentiment analysis module (ie Emotion Analysis in Figure 2) is independent of the entire model, so it can be replaced with other sentiment analyzers to achieve more fine-grained and controllable sentiment comment generation; or replaced with The theme analysis module realizes the generation of comments with controllable themes, etc. It can be seen that the solution provided by the present invention has strong promotion and application value.

实施例二Embodiment 2

本发明还提供一种多样化视频评论生成系统，其主要基于前述实施例一提供的方法实现，如图3所示，该系统主要包括：The present invention also provides a system for generating diverse video comments, which is mainly implemented based on the method provided in the first embodiment. As shown in FIG. 3 , the system mainly includes:

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将系统的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, only the division of the above-mentioned functional modules is used for illustration. In practical applications, the above-mentioned functions can be allocated to different functional modules as required. The internal structure of the system is divided into different functional modules to complete all or part of the functions described above.

需要说明的是，系统各部分的主要原理在之前的实施例一中已经做了详细的说明，故不再赘述。It should be noted that the main principles of each part of the system have been described in detail in the previous Embodiment 1, so they will not be repeated.

实施例三Embodiment 3

本发明还提供一种处理设备，如图4所示，其主要包括：一个或多个处理器；存储器，用于存储一个或多个程序；其中，当所述一个或多个程序被所述一个或多个处理器执行时，使得所述一个或多个处理器实现前述实施例提供的方法。The present invention also provides a processing device, as shown in FIG. 4 , which mainly includes: one or more processors; a memory for storing one or more programs; wherein, when the one or more programs are described When executed by one or more processors, the one or more processors are caused to implement the methods provided by the foregoing embodiments.

进一步的，所述处理设备还包括至少一个输入设备与至少一个输出设备；在所述处理设备中，处理器、存储器、输入设备、输出设备之间通过总线连接。Further, the processing device further includes at least one input device and at least one output device; in the processing device, the processor, the memory, the input device, and the output device are connected through a bus.

本发明实施例中，所述存储器、输入设备与输出设备的具体类型不做限定；例如：In this embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:

输入设备可以为触摸屏、图像采集设备、物理按键或者鼠标等；The input device can be a touch screen, an image capture device, a physical button or a mouse, etc.;

输出设备可以为显示终端；The output device can be a display terminal;

存储器可以为随机存取存储器（Random Access Memory，RAM），也可为非不稳定的存储器（non-volatile memory），例如磁盘存储器。The memory may be random access memory (Random Access Memory, RAM), or may be non-volatile memory (non-volatile memory), such as disk memory.

实施例四Embodiment 4

本发明还提供一种可读存储介质，存储有计算机程序，当计算机程序被处理器执行时实现前述实施例提供的方法。The present invention also provides a readable storage medium storing a computer program, and when the computer program is executed by a processor, the methods provided by the foregoing embodiments are implemented.

本发明实施例中可读存储介质作为计算机可读存储介质，可以设置于前述处理设备中，例如，作为处理设备中的存储器。此外，所述可读存储介质也可以是U盘、移动硬盘、只读存储器（Read-Only Memory，ROM）、磁碟或者光盘等各种可以存储程序代码的介质。In this embodiment of the present invention, the readable storage medium, as a computer-readable storage medium, may be provided in the aforementioned processing device, for example, as a memory in the processing device. In addition, the readable storage medium may also be a U disk, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, and other mediums that can store program codes.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求书的保护范围为准。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Substitutions should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. a method for generating diverse video comments, is characterized in that, comprising:

Utilize the video frame image of the current moment and its several nearest neighboring video frame images to construct a video frame image set, extract the comments in the video frame image of the current moment as a reference comment, and extract the comments in all the nearest video frame images to form a comment text;

Extract visual features from the video frame image set, extract text features from the comment text in combination with the visual features, and combine the emotion category weights corresponding to the reference comments to generate emotional latent vectors and encode them as emotional latent vector encoding features ;

Interact the input vocabulary with the generated vocabulary of the previous time step, the visual feature, the text feature and the emotional latent vector coding feature in turn to obtain the vocabulary probability distribution of the current time step, and determine the current time according to the vocabulary probability distribution of the current time step. The generated vocabulary of the step, synthesizing the generated vocabulary of all time steps constitutes the video comment at the current moment; wherein, the input vocabulary is the vocabulary in the reference comment or the vocabulary in the generated vocabulary of the previous time step;

Wherein, extracting visual features from the video frame image set is implemented by a visual encoder; extracting text features from the comment text in combination with the visual features is implemented by a text encoder; combined with the emotion category weights corresponding to the reference comments, generate emotion The hidden vector is encoded into the emotional latent vector encoding feature through the hidden vector encoder; the input vocabulary is interacted with the generated vocabulary, the visual feature, the text feature and the emotional latent vector encoding feature in the previous time step in turn to obtain the current The lexical probability distribution at time steps is implemented by the comment decoder;

Extracting visual features from the set of video frame images includes: using a visual encoder comprising a convolutional neural network and a first Transformer model to extract visual features from the set of video frame images;

Denote the set of video frame images as F= { F ₁ , F ₂ ,…, F _J }, where F _j represents the jth video frame image, j =1, 2,…, J , J represents the number of video frame images ; Each video frame image corresponds to _a moment, and the video frame image at the current moment is F ₁ , F ₂ , . _. .

The features of each video frame image are extracted through a convolutional neural network, which is expressed as:

V _j = CNN( F _j )

In the above formula, CNN represents a convolutional neural network, and V _j represents the feature of the extracted j -th video frame image F _j ;

Denote the feature set V = { V ₁ , V ₂ ,..., V _J } corresponding to the video frame image set F, and encode the feature set V corresponding to the video frame image set F through the first Transformer model, which is expressed as:

W _j = FNN _F (MultiHead-Atten _F ( V _j , V, V ) )

In the above formula, MultiHead-Atten _F and FNN _F respectively represent the multi-head attention module and the fully connected feedforward network in the first Transformer model; W _j represents the visual feature of the jth video frame image obtained by encoding processing;

Denote the visual features of the video frame image set _F as WF = { W 1 _, W ₂ ,..., W _J };

The extracting text features from the review text in combination with the visual features includes: using a text encoder including a first linear encoding layer and a second Transformer model in combination with the visual features to extract text features from the review text;

Wherein, the comment text is linearly encoded by the first linear encoding layer, and the corresponding word embedding vector set e = { e ₁ , e ₂ ,..., e M _} is obtained, where em represents the mth _in the comment text word embedding vector of each vocabulary, m =1,2,…, M , where M is the total number of comment text vocabulary;

The second Transformer model includes two multi-head attention modules and a fully connected feedforward network. The first multi-head attention module MultiHead-Atten _e1 processes the word embedding vector set e, and then passes the second multi-head attention module. The module MultiHead-Atten _e2 interacts with the fully connected feed-forward network FNN _e to interact the processing results of the first multi-head attention module with the visual features to obtain text features. The processing process is expressed as:

em '= _MultiHead - _Atten _e1 ( em , e , e )

E _m = FNN _e (MultiHead-Atten _e2 ( e _m ' , W _F , W _F ) )

Among them, em ' represents the processing result of the word embedding vector em of the mth _vocabulary _by the first multi-head attention module, WF represents the visual feature, and _Em _represents the text feature corresponding to the mth vocabulary ;

Denote the text feature of the comment text as We = { E ₁ , E ₂ _, ..., E M _} ;

The generating the emotional latent vector in combination with the emotional category weights corresponding to the reference comments, and encoding them into emotional latent vector coding features includes: determining the emotional category weights by performing emotional analysis on the reference comments, and combining the reference comments, text, and text by using a hidden vector encoder. The feature and emotion category weight generate emotion latent vector, and encode it as emotion latent vector encoding feature;

Among them, the probability distribution p(z| c , We _e ) of the emotional latent vector is modeled as a mixed Gaussian distribution model weighted by the emotional category weight _ck , which is expressed as:

Among them, _ck represents the k -th emotional category weight, K represents the number of emotional category weights, c represents the set of emotional category weights, c={ c _k } _K ,

represents the kth Gaussian distribution model,

and

respectively represent the mean and variance of the Gaussian distribution model defined by the modeling, I represents the standard unit matrix, We _e represents the text feature; z represents the emotional latent vector;

The latent vector encoder includes: two linear coding layers, a third Transformer model, a multi-layer perceptron and a sampling layer;

The two linear encoding layers are respectively called the second linear encoding layer and the third linear encoding layer. The reference comments are linearly encoded through the second linear encoding layer to obtain the corresponding word embedding vector set d= { d ₁ , d ₂ , … , d _L }, where d _l represents the word embedding vector of the lth word in the reference comment, l =1,2,…, L , L is the total number of words in the reference comment;

The third Transformer model includes two multi-head attention modules and a fully connected feedforward network. The word embedding vector set d is processed through the first multi-head attention module MultiHead-Atten _z1 , and then the second multi-head attention module is used. The module MultiHead-Atten _z2 and the fully connected feedforward network FNN _z interact with the processing results of the first multi-head attention module and the text features to obtain the intermediate hidden vector set h , and the processing process is expressed as:

d _l '= MultiHead-Atten _z1 ( d _l , d , d)

h _l = FNN _z ( _MultiHead -Atten _z2 ( d _l ' , We , We ₎ )

Among them, d _l ' represents the processing result of the l -th layer of the first multi-head attention module on the word embedding vector d _l of the l -th word in the reference review; when l = 2,..., L , the first multi-head attention The input of the l -th layer of the force module also contains the intermediate hidden vector h _{l -1} corresponding to the l -1th word in the reference comment; h _l represents the output of the l -th layer of the second multi-head attention module through a fully connected feedforward network. The intermediate hidden vector obtained by FNN _z processing, the intermediate hidden vector h _l corresponds to the lth word in the reference comment; the final processing obtains the intermediate hidden vector set h = { h ₁ , h ₂ , …, h _L }, and h _L is called is the last hidden vector;

The last layer of latent vector h _L is encoded as the mean and variance of the Gaussian distribution model through the multi-layer perceptron, which is expressed as:

Among them, MLP represents multi-layer perceptron;

The sentiment category weight _ck of the reference comment and the mean and variance of the Gaussian distribution model encoded by the multilayer perceptron are brought into the formula p(z| c , _We ) to obtain the probability distribution of the emotional latent vector z; through the sampling layer Sampling to obtain an emotional latent vector z, and then encoding into the emotional latent vector coding feature W _z through the third linear coding layer;

Interact the input vocabulary with the generated vocabulary of the previous time step, the visual feature, the text feature and the emotional latent vector coding feature in turn to obtain the vocabulary probability distribution of the current time step, and determine the current time according to the vocabulary probability distribution of the current time step. The generated vocabulary of the step is realized by the comment decoder;

The comment decoder includes a fourth linear coding layer, a fourth Transformer model, a linear layer and a softmax layer;

Among them, the fourth linear coding layer is used to encode the input vocabulary, and obtain the word embedding vector corresponding to the input vocabulary, denoted as y';

The fourth Transformer model includes four multi-head attention modules and a fully-connected feedforward network; the word embedding vector y' interacts with the generated vocabulary of the previous time step through the first multi-head attention module MultiHead-Atten _o1 . The second multi-head attention module _MultiHead -Atten _o2 interacts the interaction results of the first multi-head attention module with the visual features, and the second multi-head attention module The interaction result of the third multi-head attention module interacts with the text feature, and the interaction result of the third multi-head attention module interacts with the emotional latent vector encoding feature through the fourth multi-head attention module MultiHead-Atten _o4 . Feed the network FNN _o to output the final decoded features, and the processing process is expressed as:

y _-1 = MultiHead-Atten _o1 (y' , y , y)

y _-2 = _MultiHead - _Atten _o2 ( y _-1 , WF , WF )

y _-3 = _MultiHead - _Atten _o3 ( y _-2 , We , We )

s _t = FNN _o (MultiHead-Atten _o4 ( y _-3 , W _z , W _z ) )

Wherein , y , WF , We, and _Wz represent the generated _vocabulary , the visual feature, the text feature _, and the emotional latent vector coding feature of the previous time step in turn; s _t represents the final decoding feature;

The final decoded feature s _t goes through the linear layer and the softmax layer in turn to obtain the word probability distribution at the current moment, which is expressed as:

p(y _t |y ₀ ,…,y _{t −1} , W _z , W _F , W _e ) = Softmax( W s _t )

Among them, y ₀ ,...,y _{t −1} represents the vocabulary generated from the initial time step 0 to the previous time step t-1, that is, the generated vocabulary y in the previous time step, y _t represents the vocabulary generated at the current time step t, and W represents Parameters of the linear layer.

2. a kind of diversified video comment generation method according to claim 1, is characterized in that, described visual encoder, text encoder, latent vector encoder and comment decoder constitute a variety of video comment generation model, training stage , the objective function L of the diverse video review generation model is expressed as:

L = E _{q (z|F, e)} [log p (

|z, F, e )] − D _KL [q(z|

, F, e), p (z|F, e)]

Among them, z represents the emotional latent vector obtained by the latent vector encoder, F represents the video frame image set, e represents the word embedding vector set corresponding to the comment text,

Indicates the video comment at the current moment; p (

, F, e) represent the video comment at the current moment

, when the video frame image set F and the word embedding vector set e corresponding to the comment text are used as conditions, the probability distribution of the emotional latent vector z; p (z|F, e) represents the word embedding corresponding to the video frame image set F and the comment text When the vector set e is used as the condition, the probability distribution of the emotional latent vector z, D _KL represents the calculation of the KL distance.

3. a variety of video comment generation system, is characterized in that, based on the method described in any one of claim 1～2, this system comprises:

The information acquisition unit is used to construct a video frame image set using the video frame image at the current moment and some of its nearest neighboring video frame images, extract the comments in the video frame image at the current moment as a reference comment, and extract all the most adjacent video frame images. The comments constitute the comment text;

a visual encoder for extracting visual features from the set of video frame images;

a text encoder for extracting text features from the review text in combination with the visual features;

The latent vector encoder is used to combine the emotion category weights corresponding to the reference comments to generate emotional latent vectors and encode them as emotional latent vector coding features;

The comment decoder is used to interact the input vocabulary with the generated vocabulary of the previous time step, the visual feature, the text feature and the emotional latent vector coding feature in turn, so as to obtain the vocabulary probability distribution of the current time step; wherein, the input The vocabulary of is the vocabulary in the reference review or the vocabulary in the generated vocabulary of the previous time step;

The video comment generating unit is used for determining the generated words at the current time step according to the word probability distribution of the current time step, and synthesizing the generated words at all time steps to form the video comment at the current moment.

4. A processing device, comprising: one or more processors; a memory for storing one or more programs;

Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method according to any one of claims 1-2.

5. A readable storage medium storing a computer program, wherein the method according to any one of claims 1 to 2 is implemented when the computer program is executed by a processor.