CN106934352A

CN106934352A - A kind of video presentation method based on two-way fractal net work and LSTM

Info

Publication number: CN106934352A
Application number: CN201710111507.8A
Authority: CN
Inventors: 李楚怡; 袁东芝; 余卫宇; 胡丹
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2017-02-28
Filing date: 2017-02-28
Publication date: 2017-07-07

Abstract

The invention discloses a video description method based on two-way fractal network and LSTM. The method first samples the key frames of the video to be described, and extracts the optical flow features between two adjacent frames of the original video, then learns and obtains the high-level feature expressions of the video frames and optical flow features through two fractal networks, and then separately It is input to two recursive neural network models based on LSTM units, and finally the output values of the two independent models at each moment are weighted and averaged to obtain a description sentence corresponding to the video. The present invention uses the information of the original video frame and optical flow for the video to be described, and the added optical flow feature compensates the dynamic information that will inevitably be lost in the sampling frame, and takes into account the change of the video in the space dimension and time dimension. Furthermore, the abstract visual feature representation of the underlying features is performed through a novel fractal network, so as to more accurately analyze and mine the connections between people, objects, behaviors, and spatial location relationships involved in the video.

Description

A video description method based on two-way fractal network and LSTM

技术领域technical field

本发明属于视频描述、深度学习技术领域，具体涉及一种基于双路分形网络和LSTM的视频描述方法。The invention belongs to the technical field of video description and deep learning, and in particular relates to a video description method based on a two-way fractal network and LSTM.

背景技术Background technique

随着科技的进步和社会的发展，各类视频摄像终端尤其是智能手机已经非常普及，硬件存储的价格也日益低廉，这使得多媒体信息流成指数式增长。在大量的视频信息流面前，如何能够在尽量减少人工干预下对海量视频信息进行高效自动的分析、识别和理解，从而从语义上给予描述，已成为当前图像处理和计算机视觉研究领域的一个热门课题。对于大多数人而言，观看一个简短的视频后用语言对视频做出描述也许是件很简单的事情。但是，对于机器而言，通过提取视频中各帧图像的像素信息，并对之加以分析、处理，从而生成一句自然语言来描述则是一个富有挑战性的任务。With the advancement of technology and the development of society, all kinds of video camera terminals, especially smart phones, have become very popular, and the price of hardware storage has become increasingly low, which makes the flow of multimedia information grow exponentially. In the face of a large number of video information streams, how to efficiently and automatically analyze, recognize and understand massive video information with minimal human intervention, and thus describe it semantically, has become a hot topic in the current image processing and computer vision research fields. topic. For most people, it may be a simple matter to watch a short video and then describe the video in words. However, for a machine, it is a challenging task to generate a natural language description by extracting the pixel information of each frame image in the video, analyzing and processing it.

让机器能够高效自动地对视频做出描述在诸如视频检索、人机交互、交通安防等计算机视觉领域也有着广泛的应用前景，这将进一步促进人们对视频的语义描述的研究。Enabling machines to efficiently and automatically describe videos also has broad application prospects in computer vision fields such as video retrieval, human-computer interaction, and traffic security, which will further promote people's research on semantic descriptions of videos.

发明内容Contents of the invention

本发明的主要目的在于克服现有技术的缺点与不足，提供一种基于双路分形网络和LSTM的视频描述方法。The main purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, and provide a video description method based on two-way fractal network and LSTM.

为了到达上述目的，本发明采用以下技术方案：In order to achieve the above object, the present invention adopts the following technical solutions:

一种基于双路分形网络和LSTM的视频描述方法，其特征在于，首先对待描述视频进行关键帧的采样，并提取原视频相邻两帧之间的光流特征，然后通过两个分形网络分别学习并获得关键帧和光流特征的高层特征表达，再分别输入到两个基于LSTM单元的递归神经网络模型，最后将两个独立递归神经网络模型每个时刻的输出值进行加权平均，从而获得与所述视频对应的描述语句。具体包括如下步骤：A video description method based on two-way fractal network and LSTM, which is characterized in that firstly, the key frame sampling of the video to be described is carried out, and the optical flow characteristics between two adjacent frames of the original video are extracted, and then the two fractal networks are respectively Learn and obtain the high-level feature expression of the key frame and optical flow features, and then input them to two recurrent neural network models based on LSTM units, and finally perform a weighted average of the output values of the two independent recurrent neural network models at each moment, so as to obtain The description sentence corresponding to the video. Specifically include the following steps:

S1、对待描述视频进行关键帧的采样，并提取原视频相邻两帧之间的光流特征；S1. Sampling the key frames of the video to be described, and extracting the optical flow features between two adjacent frames of the original video;

S2、通过两个分形网络分别学习并获得视频帧和光流特征的高层特征表达；S2. Learning and obtaining high-level feature expressions of video frames and optical flow features through two fractal networks;

S3、分别将上一步获得的高层特征矢量输入到两个基于LSTM单元的递归神经网络；S3. Input the high-level feature vectors obtained in the previous step to two recurrent neural networks based on LSTM units;

S4、将两个独立模型每个时刻的输出值进行加权平均并获得视频对应的描述语句。S4. Perform a weighted average of the output values of the two independent models at each moment to obtain a description sentence corresponding to the video.

优选的，步骤S1中所述对待描述视频提取光流特征具体为：Preferably, the optical flow feature extraction of the video to be described in step S1 is specifically:

S1.1、分别计算视频每相邻两帧的x方向和y方向上的光流特征值，并归一化到[0,255]的像素范围；S1.1. Calculate the optical flow feature values in the x-direction and y-direction of each adjacent two frames of the video, and normalize to the pixel range of [0,255];

S1.2、计算光流的幅度值，并结合上一步获得的光流特征值组合成一张光流图。S1.2. Calculate the amplitude value of the optical flow, and combine the characteristic values of the optical flow obtained in the previous step to form an optical flow graph.

优选的，步骤S2中获得关键帧和光流特征的高层特征表达的具体步骤为：Preferably, the specific steps for obtaining high-level feature representations of key frames and optical flow features in step S2 are:

S2.1、对步骤S1获得的视频的关键帧以时间点的顺序依次输入到第一个处理空间维度关系的分形网络，通过网络的非线性映射关系依次生成对应的视觉特征矢量；S2.1, the key frames of the video obtained in step S1 are sequentially input to the first fractal network that processes the spatial dimension relationship in the order of time points, and the corresponding visual feature vectors are sequentially generated through the nonlinear mapping relationship of the network;

S2.2、对步骤S1获得的光流图以时间点的顺序依次输入到第二个处理时间维度关系的分形网络，通过网络的非线性映射关系依次生成对应的运动特征矢量。S2.2. The optical flow diagram obtained in step S1 is sequentially input into the second fractal network for processing the time-dimensional relationship in the order of time points, and the corresponding motion feature vectors are sequentially generated through the nonlinear mapping relationship of the network.

优选的，对于步骤S2.1和S2.2中的通过单一扩展规则的重复应用生成了一个极深的网络，其结构布局是一个截断的分形；该网络包含长度不同的相互作用子路径，但不包含任何直通式连接；同时，为了实现提取高性能固定深度子网络的能力，采用了一种路径舍弃的方法正则化分形架构里子路径的协同适应的规则；对于分形网络，训练的简单性与设计的简单性相对应，单个连接到最后一层的损失函数足以驱动内部行为去模仿深度监督；所采用的分形网络是基于分形结构的深度卷积神经网络。Preferably, the repeated application by a single extension rule in steps S2.1 and S2.2 generates an extremely deep network whose structural layout is a truncated fractal; the network contains interacting subpaths of different lengths, but Does not contain any straight-through connections; at the same time, in order to achieve the ability to extract high-performance fixed-depth sub-networks, a method of path discarding is used to regularize the rules of co-adaptation of sub-paths in the fractal architecture; for fractal networks, the simplicity of training is comparable to that of Corresponding to the simplicity of the design, a single loss function connected to the last layer is sufficient to drive the internal behavior to mimic deep supervision; the adopted fractal network is a deep convolutional neural network based on fractal structure.

优选的，步骤S2.1和S2.2中的通过单一扩展规则的重复应用生成了一个极深的网络，其结构布局是一个截断的分形具体为：Preferably, repeated application of a single extension rule in steps S2.1 and S2.2 generates an extremely deep network, and its structural layout is a truncated fractal specifically:

基础情形f₁(z)包含输入输出之间单个选定类型的层；令C表示截断分形f_C(·)的指标，f_C(·)定义了网络架构、连接以及层类型。其中，基础情形是包含单个卷积层的网络表示如公式(1-1)：The base case f ₁ (z) contains a single layer of selected type between the input and output; let _C denote the index of the truncated fractal f _C (·), which defines the network architecture, connections, and layer types. Among them, the basic situation is that the network representation containing a single convolutional layer is represented by formula (1-1):

f₁(z)＝conv(z) (1-1)f ₁ (z)=conv(z) (1-1)

递归定义接下来的分形如公式(1-2)：Recursively define the next fractal as formula (1-2):

在公式(1-2)中，表示复合，而表示连接操作，C对应于列数，或者说网络f_C(·)的宽度；深度定义为从输入到输出的最长路径上的conv层的个数，正比于2^C-1；用于分类的卷积网络通常分散布置汇集层；为了达到相同目的，使用f_C(·)作为构建单元，将之与接下来的汇集层堆叠B次，得到总深度B·2^C-1；连接操作把两个特征块合为一个；一个特征块是一个conv层的结果：在一个空间区域为固定的一些通道维持活化的张量；通道数对应于前面的conv层的过滤器的个数；当分形被扩展，把相邻的连接合并成单个连接层；连接层把所有其输入特征块合并成单个输出块。In formula (1-2), means composite, and Represents the connection operation, C corresponds to the number of columns, or the width of the network f _C (·); depth is defined as the number of conv layers on the longest path from input to output, proportional to 2 ^C-1 ; used for classification The convolutional network of usually dispersely arranges the pooling layer; in order to achieve the same purpose, use f _C (·) as the building unit, and stack it with the next pooling layer B times to obtain the total depth B 2 ^C-1 ; the connection operation Merge two feature blocks into one; a feature block is the result of a conv layer: maintain active tensors for a fixed number of channels in a spatial region; the number of channels corresponds to the number of filters in the previous conv layer; when divided The shape is expanded to combine adjacent connections into a single connection layer; a connection layer combines all its input feature blocks into a single output block.

优选的，步骤S2.1和S2.2中一种路径舍弃的方法正则化分形架构里子路径的协同适应的规则具体为：由于分形网络包含额外的大尺度结构，使用一种类似dropout和drop-connect的粗粒度正则化策略，路径舍弃通过随机丢弃连接层的操作数来禁止平行路径的共同适应，这种方式有效防止了网络使用一个路径作为锚标，另一个路径作为修正而可能引起的过拟合行为；采用两个采样策略：Preferably, a method of path discarding in steps S2.1 and S2.2 regularizes the co-adaptation rules of sub-paths in the fractal architecture as follows: since the fractal network contains additional large-scale structures, use a method similar to dropout and drop- Connect's coarse-grained regularization strategy, path discarding prohibits the co-adaptation of parallel paths by randomly discarding the operands of the connection layer. This method effectively prevents the network from using one path as an anchor and another path as a correction. Fitting behavior; two sampling strategies are used:

对于局部，连接层以固定的概率舍弃每个输入，但保证至少保留一个输入；For local, the connection layer discards each input with a fixed probability, but guarantees to keep at least one input;

对于全局，每条路径是为了整个网络选出的，通过限制这条路径是单列的，以激励每列成为有力的预测器。Globally, each path is selected for the entire network, by restricting this path to a single column, incentivizing each column to be a strong predictor.

优选的，步骤S3中所述将高层特征矢量输入到两个基于LSTM单元的递归神经网络模型具体为：Preferably, the input of high-level feature vectors to two LSTM unit-based recursive neural network models described in step S3 is specifically:

基于LSTM单元的递归神经网络包含两层LSTM单元，第一层和第二层分别包含1000个神经元，其中每个LSTM神经单元的前向传播过程可表示为：The recurrent neural network based on the LSTM unit contains two layers of LSTM units, the first layer and the second layer contain 1000 neurons respectively, and the forward propagation process of each LSTM neuron unit can be expressed as:

i_t＝σ(W_xix_t+W_hih_t-1+b_i) (1-3)i _t = σ(W _xi x _t +W _hi h _t-1 +b _i ) (1-3)

f_t＝σ(W_xfx_t+W_hfh_t-1+b_f) (1-4)f _t = σ(W _xf x _t +W _hf h _t-1 +b _f ) (1-4)

o_t＝σ(W_xox_t+W_hoh_t-1+b_o) (1-5)o _t = σ(W _xo x _t +W _ho h _t-1 +b _o ) (1-5)

c_t＝f_t*c_t-1+i_t*g_t (1-7)c _t =f _t *c _t-1 +i _t *g _t (1-7)

其中，σ(x_t)＝(1+e^-xt)^-1是sigmoid非线性激活函数，是双曲正切非线性激活函数；i_t，f_t，o_t，c_t分别代表t时刻输入门，记忆门，输出门和核心门对应的状态量；对于每个逻辑门，W_xi，W_xf，W_xo，W_xg分别代表输入门，记忆门，输出门和核心门对应的权重转移矩阵，W_hi，W_hf，W_ho，W_hg分别代表输入门，记忆门，输出门和核心门在t-1时刻隐藏层变量h_t-1对应的权重转移矩阵，b_i，b_f，b_o，b_g分别代表输入门，记忆门，输出门和核心门对应的偏置向量。Among them, σ(x _t )=(1+e ^-xt ) ^-1 is the sigmoid nonlinear activation function, is the hyperbolic tangent nonlinear activation function; it , f _t , o _t , c _t respectively represent the state quantities corresponding to the input gate, memory gate, output gate and core gate at time _t ; for each logic gate, W _xi , W _xf , W _xo , W _xg represent the weight transfer matrix corresponding to the input gate, memory gate, output gate and core gate respectively, W _hi , W _hf , _{Who ho} , W _hg represent the input gate, memory gate, output gate and core gate respectively The weight transfer matrix corresponding to the hidden layer variable h _t -1 at time t-1, b _i , b _f , b _o , b _g represent the bias vectors corresponding to the input gate, memory gate, output gate and core gate, respectively.

优选的，步骤S3中神经网络模型结构为：Preferably, the structure of the neural network model in step S3 is:

基于两层LSTM单元的递归神经网络结构图，利用这个两层堆叠的LSTM单元的递归神经网络进行对输入特征矢量的编码和解码的操作，从而实现自然语言文本的转换；其中，第一层LSTM神经元完成对每个时刻的输入视觉特征矢量的编码过程，然后每个时刻输出的隐层表达作为第二层LSTM神经元的输入；当所有视频帧的特征矢量都输入到第一层LSTM神经元后，第二层LSTM神经元就会收到一个指示符，并开始解码的任务；在解码的阶段，网络会有信息的损失，因此模型参数训练和学习的目标是在给定隐层表达和上一时刻的输出预测的前提下，最大化整个输出语句预测的对数似然函数；对于用参数θ和输出语句Y＝(y₁,y₂,…,y_m)表示的模型，参数优化目标可表示为：Based on the recurrent neural network structure diagram of the two-layer LSTM unit, the recurrent neural network of the two-layer stacked LSTM unit is used to encode and decode the input feature vector, thereby realizing the conversion of natural language text; among them, the first layer of LSTM The neuron completes the encoding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the input of the second layer of LSTM neurons; when the feature vectors of all video frames are input to the first layer of LSTM neurons After the unit, the LSTM neuron in the second layer will receive an indicator and start the task of decoding; in the decoding stage, the network will lose information, so the goal of model parameter training and learning is to express in a given hidden layer Under the premise of the output prediction at the previous moment, maximize the logarithmic likelihood function of the entire output sentence prediction; for the model represented by the parameter θ and the output sentence Y=(y ₁ ,y ₂ ,…,y _m ), the parameter The optimization objective can be expressed as:

这里，θ为参数，Y代表输出的预测语句，h为隐层表达，使用随机梯度下降法对目标函数进行优化，整个网络的误差通过反向传播算法在时间维度上累积传递。Here, θ is a parameter, Y represents the output prediction sentence, h is the expression of the hidden layer, and the stochastic gradient descent method is used to optimize the objective function, and the error of the entire network is accumulated and transferred in the time dimension through the back propagation algorithm.

优选的，步骤S4将两个神经网络独立模型每个时刻的输出值进行加权平均并获得视频对应的描述语句具体操作为：Preferably, in step S4, the output values of the two independent neural network models at each moment are weighted and averaged to obtain the description sentence corresponding to the video. The specific operation is as follows:

S4.1、将两个独立递归神经网络模型每个时刻的第二层LSTM神经元的输出值进行加权平均；S4.1, weighted average the output values of the second layer LSTM neurons at each moment of the two independent recurrent neural network models;

S4.2、采用softmax函数计算在词汇表V中每个单词的出现概率，表示为：S4.2, using the softmax function to calculate the probability of occurrence of each word in the vocabulary V, expressed as:

其中，y表示预测的单词，z_t表示递归神经网络在t时刻的输出值，W_y表示该单词在词汇表中的权重值。Among them, y represents the predicted word, z _t represents the output value of the recurrent neural network at time t, and W _y represents the weight value of the word in the vocabulary.

S4.3、在每个时刻的解码阶段，取softmax函数输出值中最大概率的单词，从而获得对应的视频描述语句。S4.3. In the decoding stage at each moment, take the word with the highest probability in the output value of the softmax function, so as to obtain the corresponding video description sentence.

本发明与现有技术相比，具有如下优点和有益效果：Compared with the prior art, the present invention has the following advantages and beneficial effects:

1、本发明添加的光流特征补偿了采样帧不可避免会丢失的动态信息，考虑到了视频在空间维度和时间维度上的变化。1. The optical flow feature added by the present invention compensates the dynamic information that will inevitably be lost in the sampling frame, and takes into account the changes in the spatial and temporal dimensions of the video.

2、本发明提供的一种基于双路分形网络和LSTM的视频描述方法通过对任意输入视频进行处理，就可以端到端地自动生成关于视频内容的一句描述性语言，可应用于视频检索、视频监控和人机交互等应用领域中。2. A video description method based on a two-way fractal network and LSTM provided by the present invention can automatically generate a sentence of descriptive language about video content end-to-end by processing any input video, which can be applied to video retrieval, Applications such as video surveillance and human-computer interaction.

3、本发明通过新颖的分形网络对底层特征进行抽象的视觉特征表达，从而更精确地分析挖掘视频中涉及的人、物、行为以及空间位置关系等联系。3. The present invention expresses the abstract visual features of the underlying features through a novel fractal network, thereby more accurately analyzing and mining the connections of people, objects, behaviors, and spatial position relationships involved in the video.

附图说明Description of drawings

图1是本发明提供的基于双路分形网络和LSTM的视频描述方法的流程框架图；Fig. 1 is the flow chart of the video description method based on two-way fractal network and LSTM provided by the present invention;

图2是本发明的实施例所采用分形子网络示意图；Fig. 2 is the schematic diagram of the fractal sub-network adopted by the embodiment of the present invention;

图3是本发明的实施例所采用的基于LSTM单元的递归神经网络的示意图。FIG. 3 is a schematic diagram of a recurrent neural network based on LSTM units used in an embodiment of the present invention.

具体实施方式detailed description

下面结合实施例及附图对本发明作进一步详细的描述，但本发明的实施方式不限于此。The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

对待描述视频进行关键帧的采样，并提取原视频相邻两帧之间的光流特征，然后通过两个分形网络分别学习并获得关键帧和光流特征的高层特征表达，再分别输入到两个基于LSTM单元的递归神经网络模型，最后将两个独立递归神经网络模型每个时刻的输出值进行加权平均，从而获得与所述视频对应的描述语句。Sampling the key frames of the video to be described, and extracting the optical flow features between two adjacent frames of the original video, and then learning and obtaining the high-level feature expressions of the key frames and optical flow features through two fractal networks, and then input them into two Based on the recurrent neural network model of the LSTM unit, the output values of the two independent recurrent neural network models at each moment are weighted and averaged to obtain the description sentence corresponding to the video.

图1为本发明的整体流程图，包括如下步骤：Fig. 1 is the overall flowchart of the present invention, comprises the following steps:

(1)对待描述视频进行关键帧的采样，并提取原视频相邻两帧之间的光流特征；其中，对待描述视频提取光流特征具体操作为：(1) Sampling the key frame of the video to be described, and extracting the optical flow features between two adjacent frames of the original video; wherein, the specific operation of extracting the optical flow features of the video to be described is:

1、分别计算视频每相邻两帧的x方向和y方向上的光流值，并归一化到[0,255]的像素范围；1. Calculate the optical flow values in the x-direction and y-direction of each adjacent two frames of the video, and normalize to the pixel range of [0,255];

2、计算光流的幅度值，并结合上一步获得的光流特征值组合成一张光流图。2. Calculate the amplitude value of the optical flow, and combine the optical flow eigenvalues obtained in the previous step to form an optical flow graph.

(2)通过两个分形网络分别学习并获得视频帧和光流特征的高层特征表达。对第一步获得的视频的采样帧以时间点的顺序依次输入到第一个处理空间维度关系的分形网络，通过网络的非线性映射关系依次生成对应的视觉特征矢量；对获得的光流图以时间点的顺序依次输入到第二个处理时间维度关系的分形网络，通过网络的非线性映射关系依次生成对应的运动特征矢量。(2) Learning and obtaining high-level feature representations of video frames and optical flow features through two fractal networks respectively. The sample frames of the video obtained in the first step are sequentially input to the first fractal network that processes the spatial dimension relationship in the order of time points, and the corresponding visual feature vectors are sequentially generated through the nonlinear mapping relationship of the network; for the obtained optical flow graph The order of time points is sequentially input to the second fractal network that deals with the time dimension relationship, and the corresponding motion feature vectors are sequentially generated through the nonlinear mapping relationship of the network.

分形网络主要是在神经网络的宏观架构上引进了一种基于自相似的设计策略，通过单一扩展规则的重复应用生成了一个极深的网络，而其结构布局是一个截断的分形。该网络包含长度不同的相互作用子路径，但不包含任何直通式连接。同时，为了实现提取高性能固定深度子网络的能力，采用了一种路径舍弃的方法正则化分形架构里子路径的协同适应。对于分形网络，训练的简单性与设计的简单性相对应，单个连接到最后一层的损失函数足以驱动内部行为去模仿深度监督。在本发明中所采用的分形网络是基于分形结构的深度卷积神经网络。The fractal network mainly introduces a design strategy based on self-similarity on the macro-architecture of the neural network. An extremely deep network is generated through the repeated application of a single expansion rule, and its structural layout is a truncated fractal. The network contains interacting subpaths of varying lengths, but does not contain any through-connections. At the same time, in order to achieve the ability to extract high-performance fixed-depth sub-networks, a method of path discarding is used to regularize the co-adaptation of sub-paths in the fractal architecture. For fractal networks, the simplicity of training corresponds to the simplicity of design, a single loss function connected to the last layer is sufficient to drive the internal behavior to mimic deep supervision. The fractal network used in the present invention is a deep convolutional neural network based on a fractal structure.

如图2中给出了分形结构的示意图，基础情形f₁(z)包含输入输出之间单个选定类型的层；令C表示截断分形f_C(·)的指标，f_C(·)定义了网络架构、连接以及层类型。其中，基础情形是包含单个卷积层的网络表示如公式(1-1)：A schematic diagram of the fractal structure is given in Fig. 2. The base case f ₁ (z) contains a single layer of selected type between the input and the output; let C denote the index of the truncated fractal f _C ( ), f _C ( ) defines Network architecture, connections, and layer types are covered. Among them, the basic situation is that the network representation containing a single convolutional layer is represented by formula (1-1):

f₁(z)＝conv(z) (1-1)f ₁ (z)=conv(z) (1-1)

然后通过递归定义接下来的分形结构如公式(1-2)：Then recursively define the next fractal structure as formula (1-2):

在公式(1-2)中，表示复合，而表示连接操作，C对应于列数，或者说网络f_C(·)的宽度；深度定义为从输入到输出的最长路径上的conv层的个数，正比于2^C-1；用于分类的卷积网络通常分散布置汇集层；为了达到相同目的，使用f_C(·)作为构建单元，将之与接下来的汇集层堆叠B次，得到总深度B·2^C-1；连接操作把两个特征块合为一个，其中一个特征块是一个卷积层的结果：在一个空间区域为固定的一些通道维持活化的张量。通道数对应于前面的卷积层过滤器的个数。当分形被扩展，相邻的连接合并成单个连接层。如图2右侧所示，这个连接层跨越多列，把所有其输入特征块合并成单个输出块。In formula (1-2), means composite, and Represents the connection operation, C corresponds to the number of columns, or the width of the network f _C (·); depth is defined as the number of conv layers on the longest path from input to output, proportional to 2 ^C-1 ; used for classification The convolutional network of usually dispersely arranges the pooling layer; in order to achieve the same purpose, use f _C (·) as the building unit, and stack it with the next pooling layer B times to obtain the total depth B 2 ^C-1 ; the connection operation Combines two feature blocks into one, where one feature block is the result of a convolutional layer: a tensor of activations maintained for a fixed number of channels in a spatial region. The number of channels corresponds to the number of filters in the previous convolutional layer. When the fractal is expanded, adjacent connections are merged into a single connected layer. As shown on the right side of Figure 2, this concatenated layer spans multiple columns, combining all its input feature blocks into a single output block.

由于分形网络包含额外的大尺度结构，因此提出使用一种类似dropout和drop-connect的粗粒度正则化策略。路径舍弃通过随机丢弃连接层的操作数来禁止平行路径的共同适应，这种方式有效防止了网络使用一个路径作为锚标，另一个路径作为修正而可能引起的过拟合行为。这里主要采用两个采样策略：Since fractal networks contain additional large-scale structures, a coarse-grained regularization strategy similar to dropout and drop-connect is proposed. Path discarding prohibits the co-adaptation of parallel paths by randomly discarding the operands of the connection layer, which effectively prevents the network from using one path as an anchor and the other path as a correction that may cause overfitting behavior. There are mainly two sampling strategies used here:

(3)分别将上一步获得的高层特征矢量输入到两个基于LSTM单元的递归神经网络。基于LSTM单元的递归神经网络包含两层LSTM单元，第一层和第二层分别包含1000个神经元，其中每个LSTM神经单元的前向传播过程可表示为：(3) Input the high-level feature vectors obtained in the previous step into two recurrent neural networks based on LSTM units. The recurrent neural network based on the LSTM unit contains two layers of LSTM units, the first layer and the second layer contain 1000 neurons respectively, and the forward propagation process of each LSTM neuron unit can be expressed as:

c_t＝f_t*c_t-1+i_t*g_t (1-7)c _t =f _t *c _t-1 +i _t *g _t (1-7)

如图3中给出的基于两层LSTM单元的递归神经网络结构图，我们利用这个两层堆叠的LSTM单元的递归神经网络进行对输入特征矢量的编码和解码的操作，从而实现自然语言文本的转换。其中，第一层LSTM神经元完成对每个时刻的输入视觉特征矢量的编码过程，然后每个时刻输出的隐层表达作为第二层LSTM神经元的输入；当所有视频帧的特征矢量都输入到第一层LSTM神经元后，第二层LSTM神经元就会收到一个指示符，并开始解码的任务。在解码的阶段，网络会有信息的损失，因此模型参数训练和学习的目标是在给定隐层表达和上一时刻的输出预测的前提下，最大化整个输出语句预测的对数似然函数。对于用参数θ和输出语句Y＝(y₁,y₂,…,y_m)表示的模型，参数优化目标可表示为：The recurrent neural network structure diagram based on the two-layer LSTM unit is given in Figure 3. We use the recurrent neural network of the two-layer stacked LSTM unit to encode and decode the input feature vector, thereby realizing the natural language text. convert. Among them, the first layer of LSTM neurons completes the encoding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the input of the second layer of LSTM neurons; when the feature vectors of all video frames are input After reaching the first layer of LSTM neurons, the second layer of LSTM neurons will receive an indicator and start the task of decoding. In the decoding stage, the network will lose information, so the goal of model parameter training and learning is to maximize the logarithmic likelihood function of the entire output sentence prediction given the hidden layer expression and the output prediction at the previous moment . For a model represented by parameter θ and output sentence Y=(y ₁ ,y ₂ ,…,y _m ), the parameter optimization objective can be expressed as:

(4)将两个独立模型每个时刻的输出值进行加权平均并获得视频对应的描述语句，具体操作为：(4) Weighted average the output values of the two independent models at each moment and obtain the description sentence corresponding to the video. The specific operation is as follows:

1、将两个独立模型每个时刻的第二层LSTM神经元的输出值进行加权平均；1. Weighted average the output values of the second layer LSTM neurons at each moment of the two independent models;

2、采用softmax函数计算在词汇表V中每个单词的出现概率，表示为：2. Use the softmax function to calculate the occurrence probability of each word in the vocabulary V, expressed as:

3、在每个时刻的解码阶段，取softmax函数输出值中最大概率的单词，从而获得对应的视频描述语句。3. In the decoding stage at each moment, take the word with the highest probability in the output value of the softmax function, so as to obtain the corresponding video description sentence.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.

Claims

1. A video description method based on two-way fractal network and LSTM, characterized in that, at first the video to be described is subjected to key frame sampling, and the optical flow feature between two adjacent frames of the original video is extracted, and then the two fractal The network learns and obtains high-level feature expressions of key frames and optical flow features, and then inputs them to two recurrent neural network models based on LSTM units, and finally weights and averages the output values of the two independent recurrent neural network models at each moment, so that Obtaining a descriptive sentence corresponding to the video; specifically including the following steps:

S1. Sampling the key frames of the video to be described, and extracting the optical flow features between two adjacent frames of the original video;

S2. Learning and obtaining high-level feature expressions of key frames and optical flow features through two fractal networks; wherein the fractal network is generated by repeated application of a single extension rule;

S3. Input the high-level feature vectors obtained in the previous step into two recurrent neural network models based on LSTM units;

S4. Perform a weighted average of the output values of the two independent recursive neural network models at each moment to obtain a description sentence corresponding to the video.

2. A kind of video description method based on two-way fractal network and LSTM according to claim 1, it is characterized in that, described in step S1, the optical flow feature of video to be described is specifically:

S1.1. Calculate the optical flow feature values in the x-direction and y-direction of each adjacent two frames of the video, and normalize to the pixel range of [0,255];

S1.2. Calculate the amplitude value of the optical flow, and combine the characteristic values of the optical flow obtained in the previous step to form an optical flow graph.

3. a kind of video description method based on two-way fractal network and LSTM according to claim 1, it is characterized in that, in the step S2, the concrete steps of obtaining the high-level feature expression of key frame and optical flow feature are:

S2.1, the key frames of the video obtained in step S1 are sequentially input to the first fractal network that processes the spatial dimension relationship in the order of time points, and the corresponding visual feature vectors are sequentially generated through the nonlinear mapping relationship of the network;

S2.2. The optical flow diagram obtained in step S1 is sequentially input into the second fractal network for processing the time-dimensional relationship in the order of time points, and the corresponding motion feature vectors are sequentially generated through the nonlinear mapping relationship of the network.

4. A kind of video description method based on two-way fractal network and LSTM according to claim 3, it is characterized in that, in steps S2.1 and S2.2, the repeated application of a single expansion rule generates an extremely deep network , its structural layout is a truncated fractal; the network contains interacting sub-paths of different lengths, but does not contain any through-connection; meanwhile, in order to achieve the ability to extract high-performance fixed-depth sub-networks, a path discarding method is adopted The method regularizes the rules for the co-adaptation of subpaths in fractal architectures; for fractal networks, the simplicity of training corresponds to the simplicity of design, and a single loss function connected to the last layer is sufficient to drive the internal behavior to mimic deep supervision; the adopted Fractal networks are deep convolutional neural networks based on fractal structures.

5. A kind of video description method based on two-way fractal network and LSTM according to claim 4, it is characterized in that, in steps S2.1 and S2.2, a very deep network is generated by repeated application of a single extension rule , its structural layout is a truncated fractal, specifically:

The base case f ₁ (z) contains a single layer of selected type between the input and output; let C denote the index of the truncated fractal f _C (·), f _C (·) defines the network architecture, connection and layer type; where, the base The situation is that a network containing a single convolutional layer is expressed as formula (1-1):

f ₁ (z)=conv(z) (1-1)

Recursively define the next fractal as formula (1-2):

In formula (1-2), means composite, and Represents the connection operation, C corresponds to the number of columns, or the width of the network f _C (·); depth is defined as the number of ^conv layers on the longest path from input to output, proportional to 2 ^C-1 ; used for classification The convolutional network of usually dispersely arranges the pooling layer; in order to achieve the same purpose, use f _C (·) as the building unit, and stack it with the next pooling layer B times to obtain the total depth B 2 ^C-1 ; the connection operation Merge two feature blocks into one; a feature block is the result of a conv layer: maintain active tensors for a fixed number of channels in a spatial region; the number of channels corresponds to the number of filters in the previous conv layer; when divided The shape is expanded to combine adjacent connections into a single connection layer; a connection layer combines all its input feature blocks into a single output block.

6. A kind of video description method based on two-way fractal network and LSTM according to claim 4, it is characterized in that, in steps S2.1 and S2.2, a kind of method that path discards is the co-adaptation of sub-path in regularization fractal architecture The specific rules are: Since the fractal network contains additional large-scale structures, a coarse-grained regularization strategy similar to dropout and drop-connect is used, and path discarding prohibits the co-adaptation of parallel paths by randomly discarding the operands of the connection layer. This method effectively prevents the network from using one path as an anchor and another path as a correction that may cause over-fitting behavior; two sampling strategies are used:

For local, the connection layer discards each input with a fixed probability, but guarantees to keep at least one input;

Globally, each path is selected for the entire network, by restricting this path to a single column, incentivizing each column to be a strong predictor.

7. A kind of video description method based on two-way fractal network and LSTM according to claim 1, it is characterized in that, described in step S3, high-level feature vector is input to two recursive neural network models based on LSTM unit specifically: The recurrent neural network based on the LSTM unit contains two layers of LSTM units, the first layer and the second layer contain 1000 neurons respectively, and the forward propagation process of each LSTM neuron unit can be expressed as:

i _t = σ(W _xi x _t +W _hi h _t-1 +b _i ) (1-3)

f _t = σ(W _xf x _t +W _hf h _t-1 +b _f ) (1-4)

o _t = σ(W _xo x _t +W _ho h _t-1 +b _o ) (1-5)

c _t =f _t *c _t-1 +i _t *g _t (1-7)

in, is the sigmoid nonlinear activation function, is the hyperbolic tangent nonlinear activation function; it , f _t , o _t , c _t respectively represent the state quantities corresponding to the input gate, memory gate, output gate and core gate at time _t ; for each logic gate, W _xi , W _xf , W _xo , W _xg represent the weight transfer matrix corresponding to the input gate, memory gate, output gate and core gate respectively, W _hi , W _hf , _{Who ho} , W _hg represent the input gate, memory gate, output gate and core gate respectively The weight transfer matrix corresponding to the hidden layer variable h _t -1 at time t-1, b _i , b _f , b _o , b _g represent the bias vectors corresponding to the input gate, memory gate, output gate and core gate, respectively.

8. A kind of video description method based on two-way fractal network and LSTM according to claim 7, it is characterized in that, in the step S3, neural network model structure is:

A recurrent neural network based on two layers of LSTM units realizes the conversion of natural language text; among them, the first layer of LSTM neurons completes the encoding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the first The input of the two-layer LSTM neuron; when the feature vectors of all video frames are input to the first-layer LSTM neuron, the second-layer LSTM neuron will receive an indicator and start the task of decoding; in the decoding stage , the network will lose information, so the goal of model parameter training and learning is to maximize the log-likelihood function of the entire output sentence prediction under the premise of the given hidden layer expression and the output prediction at the previous moment; For the model represented by θ and the output sentence Y=(y ₁ ,y ₂ ,…,y _m ), the parameter optimization objective can be expressed as:

{θ θ}^{* *} = = arg arg max max \underset{((h h,, Y Y))}{Σ Σ} log log p p ((Y Y | | h h;; θ θ)) - - - - - - ((11 - - 99))

Here, θ is a parameter, Y represents the output prediction sentence, h is the expression of the hidden layer, and the stochastic gradient descent method is used to optimize the objective function, and the error of the entire network is accumulated and transferred in the time dimension through the back propagation algorithm.

9. A kind of video description method based on two-way fractal network and LSTM according to claim 1, it is characterized in that, step S4 carries out weighted average with the output value of two neural network independent models at each moment and obtains the descriptive sentence corresponding to video The operation is:

S4.1, weighted average the output values of the second layer LSTM neurons at each moment of the two independent recurrent neural network models;

S4.2, using the softmax function to calculate the probability of occurrence of each word in the vocabulary V, expressed as:

P P ((y the y | | {z z}_{t t})) = = \frac{exp exp (({W W}_{y the y} {z z}_{t t}))}{{Σ Σ}_{{y the y}^{' '} &Element; &Element; V V} exp exp (({W W}_{{y the y}^{' '}} {z z}_{t t}))} - - - - - - ((11 - - 1010))

Among them, y represents the predicted word, z _t represents the output value of the recurrent neural network at time t, and W _y represents the weight value of the word in the vocabulary;

S4.3. In the decoding stage at each moment, take the word with the highest probability in the output value of the softmax function, so as to obtain the corresponding video description sentence.