CN115661596A

CN115661596A - Short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer

Info

Publication number: CN115661596A
Application number: CN202211334609.3A
Authority: CN
Inventors: 刘绍辉; 米亚纯; 姜峰; 张伟
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-01-31

Abstract

The invention discloses a short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer, and relates to the technical field of video violent behavior analysis. Energy evaluation", the method includes: obtaining a video clip, the frame number of the video clip is a preset frame number; performing feature extraction on the video clip based on a pre-trained 3D convolution model to obtain a plurality of feature vectors; The feature vector is position-encoded; the position-encoded multiple feature vectors are input to the pre-trained Transformer model to obtain an output vector; the output vector is input to the multi-layer perceptron model to calculate the positive energy of the video segment Score; this method is based on the 3D convolution model and the Transformer model to evaluate the positive energy of the short video, which has a good timing modeling effect, and can process videos that contain a large number of video frames for a long time. The present invention is also applied to the field of computer vision.

Description

Short video positive energy evaluation method, device and method based on 3D convolution and Transformer equipment

技术领域technical field

本发明涉及视频暴力行为分析技术领域。The invention relates to the technical field of video violent behavior analysis.

背景技术Background technique

近年来，用户生成内容视频在快手、抖音和微视等平台上呈爆炸式增长，每个平台都支持数千万和数十亿用户。如今的视频拍摄设备在拍摄水平越来越高的情况下反而变得越来越廉价，这极大的减少了短视频制作成本，因此数量庞大的短视频用户不仅是短视频的消费者也是短视频的创作者，导致网络上的短视频的数量迅速增长，短视频也因此迅速变为了人类信息的最主要源泉。In recent years, user-generated content video has exploded on platforms such as Kuaishou, Douyin, and Weishi, each supporting tens of millions and billions of users. Today's video shooting equipment is becoming cheaper and cheaper as the shooting level is getting higher and higher, which greatly reduces the cost of short video production. Therefore, a large number of short video users are not only short video consumers but also short video users. The creator of the video has led to a rapid increase in the number of short videos on the Internet, and short videos have quickly become the most important source of human information.

由于不同的用户创作视频的目的不同，总是存在一些低俗、消极或者不健康的短视频，导致短视频内容传播的信息不符合社会的主流价值观Due to the different purposes of different users creating videos, there are always some vulgar, negative or unhealthy short videos, resulting in the information disseminated by the short video content does not conform to the mainstream values of society

现有的暴力行为分析数据集比较少，很多研究都是在单一的数据集上进行的。在算法上，现有的技术往往采用2D卷积技术+RNN(GRU、LSTM)、单一的3D卷积技术和3D卷积技术+RNN(GRU、LSTM)。大量实验表明，上述方法在视频的处理上表现的不够好，RNN技术已经被证明在时序建模上比Transformer模型要差很多，而仅仅使用3D卷积技术对于视频帧的处理太过有限，无法处理长时间包含大量视频帧的视频。There are relatively few existing data sets for violent behavior analysis, and many studies are carried out on a single data set. In terms of algorithms, existing technologies often use 2D convolution technology + RNN (GRU, LSTM), single 3D convolution technology and 3D convolution technology + RNN (GRU, LSTM). A large number of experiments have shown that the above methods are not good enough in video processing. RNN technology has been proven to be much worse than Transformer model in timing modeling, and only using 3D convolution technology is too limited for video frame processing. Work with videos that contain a large number of video frames over a long period of time.

发明内容Contents of the invention

为了解决现有技术中存在的技术问题，本发明提供了一种基于3D卷积和Transformer的短视频正能量评价方法、装置及设备，基于3D卷积模型和Transformer模型对短视频进行正能量评价，有较好的时序建模效果，且能够处理长时间包含大量视频帧的视频。In order to solve the technical problems existing in the prior art, the present invention provides a short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer, and perform positive energy evaluation on short videos based on the 3D convolution model and Transformer model , has a better timing modeling effect, and can handle videos that contain a large number of video frames for a long time.

一种基于3D卷积和Transformer的短视频正能量评价方法，包括：A short video positive energy evaluation method based on 3D convolution and Transformer, including:

获取视频片段，所述视频片段帧数为预设帧数；Obtain a video clip, the frame number of the video clip is a preset frame number;

基于预先训练好的3D卷积模型对所述视频片段进行特征提取，得到多个特征向量；Carry out feature extraction to described video segment based on pre-trained 3D convolution model, obtain a plurality of feature vectors;

对所述特征向量进行位置编码；performing position encoding on the feature vector;

将经过位置编码的多个所述特征向量输入至预先训练好的Transformer模型，得到一个输出向量；Inputting a plurality of said feature vectors through position encoding into a pre-trained Transformer model to obtain an output vector;

将所述输出向量输入至多层感知机模型，计算得到视频片段的正能量分数。The output vector is input to the multi-layer perceptron model, and the positive energy score of the video segment is calculated.

进一步地，所述3D卷积模型为R3D模型，所述预设帧数为所述R3D模型每次能输入帧数的倍数。Further, the 3D convolution model is an R3D model, and the preset number of frames is a multiple of the number of frames that the R3D model can input each time.

进一步地，所述3D卷积模型包括多个全连接层，将3D卷积模型中最后一个全连接层记为第一全连接层；Further, the 3D convolutional model includes a plurality of fully connected layers, and the last fully connected layer in the 3D convolutional model is recorded as the first fully connected layer;

基于预先训练好的3D卷积模型对各个视频片段进行特征提取，得到多个特征向量，包括：Based on the pre-trained 3D convolution model, feature extraction is performed on each video clip to obtain multiple feature vectors, including:

对所述视频片段进行中心裁剪；Carry out center cropping to described video segment;

将经过中心裁剪的所述视频片段按照帧顺序进行分割，得到多个输入帧组；Segmenting the center-cropped video segment according to frame order to obtain a plurality of input frame groups;

将多个所述输入帧组按顺序输入至所述3D卷积模型中，并将得到的所述第一全连接层的输入向量记为中间特征向量；Inputting a plurality of the input frame groups into the 3D convolution model in sequence, and recording the obtained input vector of the first fully connected layer as an intermediate feature vector;

将多个所述中间特征向量输入至第二全连接层进行升维，得到多个特征向量。A plurality of the intermediate feature vectors are input to the second fully connected layer to increase the dimension to obtain a plurality of feature vectors.

进一步地，所述3D卷积模型基于数据集Kinetics进行训练，且基于如下参数进行训练：Further, the 3D convolution model is trained based on the data set Kinetics, and is trained based on the following parameters:

所述3D卷积模型为深度18层的R3D模型，迭代次数为1M次，初始学习率为1e-2，将整个训练过程分成45个阶段进行，前10个阶段进行热身，后面每十个阶段学习率缩减为原来的十分之一。The 3D convolutional model is an R3D model with a depth of 18 layers, the number of iterations is 1M, and the initial learning rate is 1e-2. The entire training process is divided into 45 stages. The first 10 stages are warmed up, and the next ten stages The learning rate is reduced to one-tenth of the original.

进一步地，所述Transformer模型包括至少一个Encoder Block，所述EncoderBlock包括一个多头注意力结构和一个多层感知机结构；Further, the Transformer model includes at least one Encoder Block, and the EncoderBlock includes a multi-head attention structure and a multi-layer perceptron structure;

所述特征向量输入每个所述多头注意力结构之前均进行标准化操作，每个所述多头注意力结构之后均采用残差连接；Standardization is performed before the feature vector is input into each of the multi-head attention structures, and a residual connection is used after each of the multi-head attention structures;

所述特征向量输入每个所述多层感知机结构之前均进行标准化操作，每个所述多层感知机结构之后均采用残差连接；Standardization is performed before the feature vector is input into each multilayer perceptron structure, and a residual connection is used after each multilayer perceptron structure;

所述多层感知机结构包括第三全连接层和第四全连接层，所述第三全连接层用于将所述特征向量的维度扩大至原来的四倍，所述第四全连接层用于将所述特征向量的维度恢复至原来的大小。The multilayer perceptron structure includes a third fully connected layer and a fourth fully connected layer, the third fully connected layer is used to expand the dimension of the feature vector to four times the original, and the fourth fully connected layer Used to restore the dimensions of the feature vectors to their original size.

进一步地，所述Transformer模型基于如下参数进行训练：Further, the Transformer model is trained based on the following parameters:

损失函数采用MSE，优化器采用AdamW优化器，Encoder Block的层数为24，多头注意力结构的头为16个；The loss function adopts MSE, the optimizer adopts AdamW optimizer, the number of layers of the Encoder Block is 24, and the heads of the multi-head attention structure are 16;

采用Warm up方法对学习率进行预热操作，选择Linear Warm up作为具体的Warmup策略，设置初始的学习率为1e-5，预热步数设置为15步，训练的总数为60步。The Warm up method is used to warm up the learning rate, and Linear Warm up is selected as the specific Warmup strategy. The initial learning rate is set to 1e-5, the number of warm-up steps is set to 15, and the total number of training steps is 60.

进一步地，将经过位置编码的多个所述特征向量输入至预先训练好的Transformer模型，得到一个输出向量，包括：Further, input a plurality of position-encoded feature vectors into the pre-trained Transformer model to obtain an output vector, including:

将经过位置编码的多个所述特征向量与分类头向量cls-token输入至Transformer模型中；Input the position-encoded multiple feature vectors and classification head vector cls-token into the Transformer model;

将所述Transformer模型中的分类头向量cls-token对应的输出作为分类特征输入到最后的全连接层，将1024维的特征变换变为一维向量，得到输出向量。The output corresponding to the classification header vector cls-token in the Transformer model is input to the final fully connected layer as a classification feature, and the 1024-dimensional feature is transformed into a one-dimensional vector to obtain an output vector.

进一步地，所述预设帧数为96，所述R3D模型每次能输入帧数为16，所述特征向量数量为6。Further, the preset number of frames is 96, the number of frames that can be input each time of the R3D model is 16, and the number of feature vectors is 6.

一种基于3D卷积和Transformer的短视频正能量评价装置，包括：A short video positive energy evaluation device based on 3D convolution and Transformer, including:

视频获取模块，用于获取视频片段，所述视频片段帧数为预设帧数；A video acquisition module, configured to acquire video clips, the number of frames of the video clips is a preset number of frames;

特征提取模块，用于基于预先训练好的3D卷积模型对视频片段进行特征提取，得到多个特征向量；Feature extraction module, for carrying out feature extraction to video segment based on pre-trained 3D convolution model, obtains a plurality of feature vectors;

位置编码模块，用于对所述特征向量进行位置编码；A position encoding module, configured to perform position encoding on the feature vector;

输出计算模块，用于将经过位置编码的多个所述特征向量输入至预先训练好的Transformer模型，得到一个输出向量；An output calculation module, configured to input a plurality of position-encoded feature vectors into a pre-trained Transformer model to obtain an output vector;

分数计算模块，用于将所述输出向量输入至多层感知机模型，计算得到视频片段的正能量分数。The score calculation module is used to input the output vector to the multi-layer perceptron model to calculate the positive energy score of the video segment.

一种电子设备，包括处理器和存储装置，所述存储装置中存有多条指令，所述处理器用于读取所述存储装置中的多条指令并执行上述方法。An electronic device includes a processor and a storage device, wherein a plurality of instructions are stored in the storage device, and the processor is used to read the plurality of instructions in the storage device and execute the above method.

本发明提供的基于3D卷积和Transformer的短视频正能量评价方法、装置及设备，至少包括如下有益效果：The short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer provided by the present invention at least include the following beneficial effects:

(1)基于3D卷积模型和Transformer模型对短视频进行正能量评价，使用3D卷积模型提取视频的特征，然后使用Transformer模型来对视频进行时序特征融合，有较好的时序建模效果，且能够处理长时间包含大量视频帧的视频；(1) Based on the 3D convolution model and the Transformer model, the positive energy evaluation of the short video is performed, and the 3D convolution model is used to extract the features of the video, and then the Transformer model is used to fuse the temporal features of the video, which has a good temporal modeling effect, And be able to handle videos that contain a large number of video frames for a long time;

(2)采用特定的参数进行3D卷积模型和Transformer模型的训练，可以在改善训练效果的同时，有效提高训练的效率；(2) Use specific parameters to train the 3D convolution model and Transformer model, which can effectively improve the training efficiency while improving the training effect;

(3)基于3D卷积模型对视频片段进行特征提取时，采取了升维操作，提升了模型复杂度，让模型具有更好的拟合能力；(3) When extracting features from video clips based on the 3D convolution model, the dimension-up operation is adopted to increase the complexity of the model and make the model have better fitting ability;

(4)对3D卷积模型提取得到的特征向量进行位置编码，有效保留帧间的时序信息不被丢失，实现更好的时序特征融合效果。(4) The position encoding is performed on the feature vector extracted by the 3D convolution model, which effectively preserves the timing information between frames from being lost, and achieves a better timing feature fusion effect.

附图说明Description of drawings

图1为本发明提供的基于3D卷积和Transformer的短视频正能量评价方法一种实施例的流程图；Fig. 1 is the flowchart of an embodiment of the short video positive energy evaluation method based on 3D convolution and Transformer provided by the present invention;

图2为本发明提供的Transformer模型中Encoder Block一种实施例的结构示意图。Fig. 2 is a schematic structural diagram of an embodiment of the Encoder Block in the Transformer model provided by the present invention.

具体实施方式Detailed ways

为了更好的理解上述技术方案，下面将结合说明书附图以及具体的实施方式对上述技术方案做详细的说明。In order to better understand the above-mentioned technical solution, the above-mentioned technical solution will be described in detail below in conjunction with the accompanying drawings and specific implementation methods.

需要说明的是，本实施例中“正能量”这一概念，用于描述短视频情感倾向，具体指短视频内容非暴力行为相关，也不涉及低俗、消极或者不健康的内容。It should be noted that the concept of "positive energy" in this embodiment is used to describe the emotional tendency of the short video, specifically referring to the content of the short video is not related to violent behavior, and does not involve vulgar, negative or unhealthy content.

参见图1，在一些实施例中，一种基于3D卷积和Transformer的短视频正能量评价方法，其特征在于，包括：Referring to Fig. 1, in some embodiments, a short video positive energy evaluation method based on 3D convolution and Transformer is characterized in that it includes:

S1、获取视频片段，所述视频片段帧数为预设帧数；S1. Obtain a video clip, the frame number of the video clip is a preset frame number;

S2、基于预先训练好的3D卷积模型对所述视频片段进行特征提取，得到多个特征向量；S2. Perform feature extraction on the video clip based on the pre-trained 3D convolution model to obtain multiple feature vectors;

S3、对所述特征向量进行位置编码；S3. Perform position encoding on the feature vector;

S4、将经过位置编码的多个所述特征向量输入至预先训练好的Transformer模型，得到一个输出向量；S4. Input a plurality of position-encoded feature vectors into a pre-trained Transformer model to obtain an output vector;

S5、将所述输出向量输入至多层感知机模型，计算得到视频片段的正能量分数。S5. Input the output vector into the multi-layer perceptron model, and calculate the positive energy score of the video segment.

图1中，R3D为3D版本的ResNet模型，FC为全连接层，Position Embedding表示位置编码，MLP Head表示多层感知机。In Figure 1, R3D is a 3D version of the ResNet model, FC is a fully connected layer, Position Embedding represents position encoding, and MLP Head represents a multi-layer perceptron.

作为一种较优的实施方式，所述3D卷积模型为R3D模型，所述预设帧数为所述R3D模型每次能输入帧数的倍数。R3D模型保持原有的2D的ResNet的整体架构不变，将原来的3×3的2D卷积扩展为3×3×3的3D卷积，同时池化层也替换为3D池化。R3D的模型的输入维度为3×16×112×112。由于R3D模型的模型参数量比较少，所以使用R3D模型来进行3D卷积时，不仅能够获得更好的效果，同时也并不会使得模型速度变慢。As a preferred implementation manner, the 3D convolution model is an R3D model, and the preset number of frames is a multiple of the number of frames that the R3D model can input each time. The R3D model keeps the overall structure of the original 2D ResNet unchanged, expands the original 3×3 2D convolution to a 3×3×3 3D convolution, and replaces the pooling layer with 3D pooling. The input dimension of the R3D model is 3×16×112×112. Since the model parameters of the R3D model are relatively small, when the R3D model is used for 3D convolution, not only can better results be obtained, but also the speed of the model will not be slowed down.

本实施例提供的方法，采用包含多个负能量行为的数据集进行训练，其中包括打架、血腥、枪击、爆炸、吸烟等，相较于现有的单一数据集来说，得到的模型具有更好的正能量评分效果，识别更加全面。The method provided in this embodiment uses a data set containing multiple negative energy behaviors for training, including fighting, blood, gunshots, explosions, smoking, etc. Compared with the existing single data set, the obtained model has more Good positive energy scoring effect, more comprehensive recognition.

具体地，步骤S2中，基于预先训练好的3D卷积模型对所述视频片段进行特征提取，得到多个特征向量，包括：Specifically, in step S2, feature extraction is performed on the video clip based on the pre-trained 3D convolution model to obtain multiple feature vectors, including:

S21、对所述视频片段进行中心裁剪；S21. Perform center cropping on the video segment;

S22、将经过中心裁剪的所述视频片段按照帧顺序进行分割，得到多个输入帧组；S22. Segment the center-cropped video segment according to frame order to obtain a plurality of input frame groups;

S23、将多个所述输入帧组按顺序输入至所述3D卷积模型中，并将得到的所述第一全连接层的输入向量记为中间特征向量；S23. Input a plurality of input frame groups into the 3D convolution model in sequence, and record the obtained input vector of the first fully connected layer as an intermediate feature vector;

S24、将多个所述中间特征向量输入至第二全连接层进行升维，得到多个特征向量。S24. Input the plurality of intermediate feature vectors to the second fully connected layer to increase the dimension to obtain a plurality of feature vectors.

在一种具体的应用场景中，采用R3D模型对视频片段进行特征提取。R3D模型每次能输入帧数为16帧。获取视频片段时，对完整的视频帧采用随机抽取的策略，每次从整个视频中随机抽取96帧视频片段，96帧为R3D模型每次能输入帧数的倍数，然后对这96帧的视频片段进行中心裁剪，得到96帧的112*112的视频帧作为网络的输入。由于R3D模型每次只能输入16帧，因此将视频片段按照帧顺序进行分割，得到6个输入帧组，每个输入帧组中包含16帧，经过R3D模型最终得到6个512维的中间特征向量，再经过第二全连接层将特征的维度从512维升至1024维。In a specific application scenario, the R3D model is used to extract features from video clips. The R3D model can input 16 frames each time. When obtaining video clips, a random sampling strategy is used for the complete video frames, and 96 frames of video clips are randomly selected from the entire video each time. 96 frames are the multiples of the number of frames that can be input by the R3D model each time, and then these 96 frames of video The segment is center-cut, and 96 frames of 112*112 video frames are obtained as the input of the network. Since the R3D model can only input 16 frames at a time, the video clips are divided according to the frame order to obtain 6 input frame groups, and each input frame group contains 16 frames. After the R3D model, 6 512-dimensional intermediate features are finally obtained. Vector, and then through the second fully connected layer to increase the dimension of the feature from 512 dimensions to 1024 dimensions.

其中，R3D模型中的最后一个全连接层的输入作为提取的特征，即中间特征向量，再进行升维操作，得到后续步骤中用于输入至transformer模型的特征向量。步骤S24中，基于第二全连接层对得到的中间特征向量进行升维，提升了模型复杂度，让模型具有更好的拟合能力。Among them, the input of the last fully connected layer in the R3D model is used as the extracted feature, that is, the intermediate feature vector, and then the dimension is increased to obtain the feature vector for input to the transformer model in the subsequent steps. In step S24, the dimensionality of the obtained intermediate feature vector is increased based on the second fully connected layer, which increases the complexity of the model and makes the model have better fitting ability.

作为一种较优的实施方式，所述3D卷积模型基于数据集Kinetics进行训练，且基于如下参数进行训练：3D卷积模型为深度18层的R3D模型，迭代次数为1M次，初始学习率为1e-2，将整个训练过程分成45个阶段进行，前10个阶段进行热身，后面每十个阶段学习率缩减为原来的十分之一。As a preferred embodiment, the 3D convolution model is trained based on the data set Kinetics, and is trained based on the following parameters: the 3D convolution model is an R3D model with a depth of 18 layers, the number of iterations is 1M, and the initial learning rate For 1e-2, the entire training process is divided into 45 stages, the first 10 stages are warmed up, and the learning rate is reduced to one-tenth of the original every ten stages.

参见图2，在一些实施例中，使用的Transformer模型包括至少一个EncoderBlock，所述Encoder Block包括一个多头注意力结构和一个多层感知机结构。所述特征向量输入每个所述多头注意力结构(Multi-Head Attention)之前均进行标准化Norm操作，每个所述多头注意力结构之后均采用残差连接；所述特征向量输入每个所述多层感知机结构(MLP)之前均进行标准化Norm操作，每个所述多层感知机结构之后均采用残差连接。所述多层感知机结构包括第三全连接层和第四全连接层，所述第三全连接层用于将所述特征向量的维度扩大至原来的四倍，所述第四全连接层用于将所述特征向量的维度恢复至原来的大小。图2中，Norm表示标准化层，MLP表示多层感知机结构，Multi-Head Attention表示多头注意力结构。Referring to FIG. 2 , in some embodiments, the Transformer model used includes at least one EncoderBlock, and the Encoder Block includes a multi-head attention structure and a multi-layer perceptron structure. Before the feature vector is input into each of the multi-head attention structures (Multi-Head Attention), a standardized Norm operation is performed, and after each of the multi-head attention structures, a residual connection is used; the feature vector is input into each of the multi-head attention structures. The multi-layer perceptron structure (MLP) is preceded by a standardized Norm operation, and each multi-layer perceptron structure is followed by a residual connection. The multilayer perceptron structure includes a third fully connected layer and a fourth fully connected layer, the third fully connected layer is used to expand the dimension of the feature vector to four times the original, and the fourth fully connected layer Used to restore the dimensions of the feature vectors to their original size. In Figure 2, Norm represents the normalization layer, MLP represents the multi-layer perceptron structure, and Multi-Head Attention represents the multi-head attention structure.

在一些实施例中，Transformer模型是将多个Encoder Block进行多次堆叠得到的。对于一个Encoder Block，其主要由一个Multi-Head Attention(MSA)和一个MLP结构组成，还包括残差连接和层标准化两个操作。需要注意的是，层标准化是对每个样本的所有特征进行处理，而批标准化是对每个通道的所有样本进行处理。在将数据输入到MSA和MLP两个结构前，采用Layer Normalization操作对数据进行标准化处理，并且，在每个MSA和MLP的结构后面应用残差连接。MLP结构跟在MSA后面，MLP结构包含两个全连接层，第一个全连接层将特征的维度扩大到原来的四倍，第二个全连接层将特征的维度恢复为原来的大小，其中的激活函数均采用GELU(Gaussian Error Linear Unit)。将经过位置编码的特征向量输入到Transformer模型中之后，将不断地前向传播经过多个Encoder Block，在经过每个Encoder Block之后序列维度都是不变的，最后将cls-token对应的特征向量作为输出。In some embodiments, the Transformer model is obtained by stacking multiple Encoder Blocks multiple times. For an Encoder Block, it mainly consists of a Multi-Head Attention (MSA) and an MLP structure, and also includes two operations of residual connection and layer normalization. It should be noted that layer normalization processes all features of each sample, while batch normalization processes all samples of each channel. Before inputting the data into the two structures of MSA and MLP, the layer normalization operation is used to normalize the data, and the residual connection is applied after the structure of each MSA and MLP. The MLP structure follows the MSA. The MLP structure contains two fully connected layers. The first fully connected layer expands the dimension of the feature to four times the original size, and the second fully connected layer restores the dimension of the feature to its original size. All activation functions use GELU (Gaussian Error Linear Unit). After inputting the position-encoded feature vector into the Transformer model, it will continue to propagate forward through multiple Encoder Blocks. After each Encoder Block, the sequence dimension remains unchanged, and finally the feature vector corresponding to cls-token as output.

步骤S3中，对得到的特征向量进行位置编码，有效保留帧间的时序信息不被丢失。In step S3, position coding is performed on the obtained feature vectors to effectively preserve the timing information between frames without being lost.

在一种具体应用场景中，将6个1024维的特征向量进行位置编码，以保存视频块之间时序信息，在这里使用的是可学习的位置编码向量，该位置编码相当于一章表，表中有N行，N的数值和输入的序列的长度是一样的，每一行代表一个视频帧，在加入位置编码之后序列的维度保持不变。In a specific application scenario, six 1024-dimensional feature vectors are position-encoded to preserve the timing information between video blocks. Here, a learnable position-encoding vector is used, which is equivalent to a chapter table. There are N rows in the table, and the value of N is the same as the length of the input sequence. Each row represents a video frame, and the dimension of the sequence remains unchanged after adding position encoding.

步骤S4中，将经过位置编码的多个所述特征向量输入至预先训练好的Transformer模型，得到一个输出向量，包括：In step S4, input the position-encoded multiple feature vectors into the pre-trained Transformer model to obtain an output vector, including:

S41、将经过位置编码的多个所述特征向量与分类头向量cls-token输入至Transformer模型中；S41. Input the plurality of position-encoded feature vectors and classification head vector cls-token into the Transformer model;

S42、将Transformer模型中分类头向量cls-token对应的向量作为输出向量。S42. Use the vector corresponding to the classification head vector cls-token in the Transformer model as an output vector.

作为一种较优的实施方式，Transformer模型基于如下参数进行训练：损失函数采用MSE，优化器采用AdamW优化器，Encoder Block的层数为24，多头注意力结构的头为16个，将其中的分类头向量cls-token对应的输出作为特征输入到最后的全连接层，将1024维的特征变换变为一维的分数。其中，AdamW优化器是Adam优化器的变种，引入了权重衰减和L2正则化。As a better implementation, the Transformer model is trained based on the following parameters: the loss function uses MSE, the optimizer uses the AdamW optimizer, the number of layers of the Encoder Block is 24, and the heads of the multi-head attention structure are 16. The output corresponding to the classification header vector cls-token is input as a feature to the final fully connected layer, and the 1024-dimensional feature is transformed into a one-dimensional score. Among them, the AdamW optimizer is a variant of the Adam optimizer, which introduces weight decay and L2 regularization.

作为一种较优的实施方式，采用Warm up方法对学习率进行预热操作，选择LinearWarm up作为具体的Warm up策略，设置初始的学习率为1e-5，预热步数设置为15步，训练的总数为60步，即学习率在前15个epoch中先从6.67e-7均匀增大到初始的学习率1e-5后，再均匀的降到2.22e-7。预热操作的原理是，让学习率从一个较小的值开始慢慢增大，等模型的变化较小之后，再使用较大的学习率来训练。通过这种方式，可以有效加快模型的收敛速度，从而提高模型训练的效率。As a better implementation, the Warm up method is used to warm up the learning rate, and LinearWarm up is selected as the specific Warm up strategy. The initial learning rate is set to 1e-5, and the number of warm-up steps is set to 15 steps. The total number of training steps is 60, that is, the learning rate increases uniformly from 6.67e-7 to the initial learning rate 1e-5 in the first 15 epochs, and then decreases uniformly to 2.22e-7. The principle of the warm-up operation is to let the learning rate increase slowly from a small value, and then use a larger learning rate for training after the model changes little. In this way, the convergence speed of the model can be effectively accelerated, thereby improving the efficiency of model training.

在一些实施例中，当视频长度较长时，重复步骤S1-S5，进行多次片段提取并进行分析。In some embodiments, when the length of the video is longer, steps S1-S5 are repeated to extract and analyze multiple segments.

在一些实施例中，提供一种基于3D卷积和Transformer的短视频正能量评价装置，包括：In some embodiments, a short video positive energy evaluation device based on 3D convolution and Transformer is provided, including:

分数计算模块，用于将多个所述输出向量输入至多层感知机模型，计算得到视频片段的正能量分数。The score calculation module is used to input a plurality of output vectors into the multi-layer perceptron model, and calculate the positive energy score of the video segment.

在一些实施例中，提供一种电子设备，包括处理器和存储装置，所述存储装置中存有多条指令，所述处理器用于读取所述存储装置中的多条指令并执行上述方法。In some embodiments, an electronic device is provided, including a processor and a storage device, wherein a plurality of instructions are stored in the storage device, and the processor is configured to read the plurality of instructions in the storage device and execute the above method .

在一种具体的应用场景中，执行上述方法的设备为NVIDIA GTX 3090GPU，机器学习框架使用的pytorch1.12。In a specific application scenario, the device for executing the above method is NVIDIA GTX 3090GPU, and the machine learning framework uses pytorch1.12.

在现有的暴力行为视频分析方法中，多数采用2D卷积技术与RNN模型(GRU、LSTM)结合，单一的3D卷积技术，或3D卷积技术与RNN模型的结合。采用上述方法进行短视频正能量评分时，RNN技术已经被证明在时序建模上比Transformer模型要差很多，而仅仅使用3D卷积技术对于视频帧的处理太过有限，无法处理长时间包含大量视频帧的视频。In the existing violent behavior video analysis methods, most of them use the combination of 2D convolution technology and RNN model (GRU, LSTM), a single 3D convolution technology, or a combination of 3D convolution technology and RNN model. When using the above method to score short video positive energy, RNN technology has been proven to be much worse than the Transformer model in terms of time series modeling, and only using 3D convolution technology is too limited for the processing of video frames, and cannot handle long-term content containing a large number of video frames of video.

而将3D卷积模型与Transformer模型进行组合使用时，存在Transformer模型的参数量比较大的问题需要解决。本实施例提供的基于3D卷积和Transformer的短视频正能量评价方法，采用对R3D提取的视频快的特征进行时序建模的技术手段，实现了将3D卷积模型与Transformer模型应用于短视频正能量评价中，既能够有较好的时序建模效果，且能够处理长时间包含大量视频帧的视频。为了突出本技术的优势，与三个模型进行对比，分别是C3D模型+GRU、C3D模型+Transformer、R3D模型+GRU，发现本技术的效果明显好于其他三个模型，证明了R3D特征提取器和Transformer模型组合的有效性。When combining the 3D convolution model and the Transformer model, there is a problem that the parameters of the Transformer model are relatively large and need to be solved. The short video positive energy evaluation method based on 3D convolution and Transformer provided in this embodiment uses the technical means of time-series modeling of the features of the video clip extracted by R3D, and realizes the application of the 3D convolution model and the Transformer model to the short video In the positive energy evaluation, it can not only have a better timing modeling effect, but also be able to process videos that contain a large number of video frames for a long time. In order to highlight the advantages of this technology, compared with three models, namely C3D model + GRU, C3D model + Transformer, R3D model + GRU, it is found that the effect of this technology is significantly better than the other three models, which proves that the R3D feature extractor The effectiveness of combining with the Transformer model.

尽管已描述了本发明的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例作出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。显然，本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样，倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内，则本发明也意图包含这些改动和变型在内。While preferred embodiments of the invention have been described, additional changes and modifications to these embodiments can be made by those skilled in the art once the basic inventive concept is appreciated. Therefore, it is intended that the appended claims be construed to cover the preferred embodiment as well as all changes and modifications which fall within the scope of the invention. Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalent technologies, the present invention also intends to include these modifications and variations.

Claims

1. A short video positive energy evaluation method based on 3D convolution and a Transformer is characterized by comprising the following steps:

acquiring a video clip, wherein the frame number of the video clip is a preset frame number;

performing feature extraction on the video clip based on a pre-trained 3D convolution model to obtain a plurality of feature vectors;

performing position coding on the feature vector;

inputting a plurality of feature vectors subjected to position coding into a pre-trained transform model to obtain an output vector;

and inputting the output vector to a multilayer perceptron model, and calculating to obtain a positive energy fraction of the video clip.

2. The method of claim 1, wherein the 3D convolution model is an R3D model, and the preset frame number is a multiple of a number of frames that the R3D model can input at a time.

3. The method of claim 2, wherein the 3D convolution model includes a plurality of fully connected layers, and the last fully connected layer in the 3D convolution model is denoted as a first fully connected layer;

carrying out feature extraction on each video segment based on a pre-trained 3D convolution model to obtain a plurality of feature vectors, wherein the feature vectors comprise:

performing center cropping on the video clip;

dividing the video clips subjected to center cutting according to a frame sequence to obtain a plurality of input frame groups;

inputting the plurality of input frame groups into the 3D convolution model in sequence, and recording the obtained input vector of the first full-connection layer as a middle characteristic vector;

and inputting the plurality of intermediate feature vectors into a second full-connection layer for dimensionality increase to obtain a plurality of feature vectors.

4. The method of claim 2, wherein the 3D convolution model is trained based on a dataset Kinetics and based on the following parameters:

the 3D convolution model is an R3D model with 18 layers of depth, the iteration times are 1M times, the initial learning rate is 1e-2, the whole training process is divided into 45 stages, the first 10 stages are warmed up, and the learning rate of every ten stages is reduced to one tenth of the original learning rate.

5. The method of claim 4, wherein the transform model comprises at least one Encoder Block, wherein the Encoder Block comprises a multi-head attention structure and a multi-layer perceptron structure;

before the feature vector is input into each multi-head attention structure, standardization operation is carried out, and residual connection is adopted after each multi-head attention structure;

before the feature vectors are input into each multi-layer perceptron structure, standardization operation is carried out, and residual connection is adopted after each multi-layer perceptron structure;

the multilayer perceptron structure comprises a third full connection layer and a fourth full connection layer, wherein the third full connection layer is used for expanding the dimensionality of the characteristic vector to four times of the original dimensionality, and the fourth full connection layer is used for restoring the dimensionality of the characteristic vector to the original size.

6. The method of claim 5, wherein the Transformer model is trained based on the following parameters:

the loss function adopts MSE, the optimizer adopts an AdamW optimizer, the number of layers of the Encoder Block is 24, and the number of heads of a multi-head attention structure is 16;

preheating the learning rate by adopting a Warm up method, selecting Linear Warm up as a specific Warm up strategy, setting the initial learning rate to be 1e-5, setting the preheating step number to be 15 steps, and setting the total training number to be 60 steps.

7. The method of claim 5, wherein inputting a plurality of the position-coded feature vectors into a pre-trained transform model to obtain an output vector comprises:

inputting a plurality of feature vectors subjected to position coding and a classification head vector cls-token into a Transformer model;

and inputting output corresponding to the classification head vector cls-token in the Transformer model as classification features to a final full-connection layer, and converting 1024-dimensional features into one-dimensional vectors to obtain output vectors.

8. The method according to claim 2, wherein the preset frame number is 96, the number of frames that the R3D model can input at a time is 16, and the number of feature vectors is 6.

9. A short video positive energy evaluation device based on 3D convolution and a Transformer is characterized by comprising:

the video acquisition module is used for acquiring a video clip, and the frame number of the video clip is a preset frame number;

the feature extraction module is used for extracting features of the video clips based on a pre-trained 3D convolution model to obtain a plurality of feature vectors;

the position coding module is used for carrying out position coding on the feature vector;

the output calculation module is used for inputting the plurality of feature vectors subjected to position coding into a pre-trained Transformer model to obtain an output vector;

and the score calculating module is used for inputting the output vector to the multilayer perceptron model and calculating to obtain the positive energy score of the video clip.

10. An electronic device comprising a processor and a memory means, wherein a plurality of instructions are stored in the memory means, and wherein the processor is configured to read the plurality of instructions from the memory means and to perform the method according to any one of claims 1 to 8.