CN105279495A

CN105279495A - Video description method based on deep learning and text summarization

Info

Publication number: CN105279495A
Application number: CN201510697454.3A
Authority: CN
Inventors: 李广; 马书博; 韩亚洪
Original assignee: Tianjin University
Current assignee: Guangzhou Wellthinker Automation Technology Co ltd
Priority date: 2015-10-23
Filing date: 2015-10-23
Publication date: 2016-01-27
Anticipated expiration: 2035-10-23
Also published as: CN105279495B

Abstract

The invention discloses a video description method based on deep learning and text summarization. The video description method comprises the following steps: through a traditional image data set, training a convolutional neural network model according to an image classification task; extracting a video frame sequence of a video, utilizing the convolutional neural network model to extract convolutional neural network characteristics to form a <video frame sequence, text description sequence> pair which is used as the input of a recurrent neural network model, and training to obtain the recurrent neural network model; describing the video frame sequence of the video to be described through the recurrent neural network model obtained by training to obtain description sequences; and through a method that graph-based vocabulary centrality is used as the significance of the text summarization, sorting the description sequences, and outputting a final description result of the video. An event which happens in one video and object attributes associated with the event are described through a natural language so as to achieve a purpose that video contents are described and summarized.

Description

A video description method based on deep learning and text summarization

技术领域technical field

本发明涉及视频描述领域，尤其涉及一种基于深度学习和文本总结的视频描述方法。The invention relates to the field of video description, in particular to a video description method based on deep learning and text summarization.

背景技术Background technique

使用自然语言对一个视频进行描述，无论是对该视频的理解还是在Web检索该视频都是极其重要的。同时，视频的语言描述也是多媒体和计算机视觉领域中重点研究的课题。所谓视频描述，是指对给定的视频，通过观察它所包含的内容，即获得视频特征，并根据这些内容，生成相应的句子。当人们看到一个视频时，特别是一些动作类别的视频，在观看完视频后会对该视频有一定程度的了解，并可以通过语言去讲述视频中所发生的事情。例如：使用“一个人正在骑摩托”这样的句子对视频进行描述。然而，面对大量的视频，采用人工的方式对视频进行逐一的描述需要大量的时间，人力和财力。使用计算机技术对视频特征进行分析，并与自然语言处理的方法进行结合，生成对视频的描述是非常有必要的。一方面，通过视频描述的方法，人们可以从语义的角度更加精确的去理解视频。另一方面，在视频检索领域，当用户输入一段文字性的描述来检索出相应的视频这件事情是非常困难的并且具有一定的挑战。Using natural language to describe a video is extremely important for both understanding the video and retrieving the video on the Web. At the same time, the language description of video is also a key research topic in the field of multimedia and computer vision. The so-called video description means that for a given video, by observing the content contained in it, the video features are obtained, and corresponding sentences are generated according to these contents. When people see a video, especially some action videos, they will have a certain degree of understanding of the video after watching the video, and can use language to tell what happened in the video. Example: Describe a video with a sentence like "A man is riding a motorcycle." However, in the face of a large number of videos, it takes a lot of time, manpower and financial resources to manually describe the videos one by one. It is very necessary to use computer technology to analyze video features and combine them with natural language processing methods to generate video descriptions. On the one hand, through the method of video description, people can understand videos more precisely from the perspective of semantics. On the other hand, in the field of video retrieval, when a user enters a textual description to retrieve the corresponding video, it is very difficult and has certain challenges.

在过去的几年中已经涌现出了各种各样的视频描述方法，比如：通过对视频特征进行分析，可以识别视频中存在的物体，以及物体之间所具有的动作关系。然后采用固定的语言模板：主语+动词+宾语，从所识别物体中确定主语、宾语以及将物体之间的动作关系作为谓语，采用这样的方式生成句子对视频的描述。In the past few years, a variety of video description methods have emerged. For example, by analyzing video features, objects in the video and the action relationship between objects can be identified. Then use a fixed language template: subject + verb + object, determine the subject and object from the recognized objects, and use the action relationship between objects as the predicate, and use this method to generate sentences to describe the video.

但是这样的方法存在一定的局限性，例如:使用语言模板生成句子容易导致生成的句子句式相对固定，句式过于单一，缺乏人类自然语言表达的色彩。同时，识别视频中的物体和动作等均需要使用不同的特征，造成步骤相对繁琐，并需要大量的时间对视频特征进行训练。不仅如此，识别的准确率直接影响生成句子的好坏，这种分步式的方法需要在每个步骤保证较高的正确性，实现有一定的困难。However, there are certain limitations in such a method. For example, the use of language templates to generate sentences can easily lead to a relatively fixed sentence structure, which is too single and lacks the color of human natural language expression. At the same time, different features are required to recognize objects and actions in the video, which makes the steps relatively cumbersome and requires a lot of time to train the video features. Not only that, the accuracy of recognition directly affects the quality of generated sentences. This step-by-step method needs to ensure high accuracy at each step, and it is difficult to implement.

发明内容Contents of the invention

本发明提供了一种基于深度学习和文本总结的视频描述方法，本发明通过自然语言描述一段视频中正在发生的事件以及与事件相关的物体属性，从而达到对视频内容进行描述和总结的目的，详见下文描述：The present invention provides a video description method based on deep learning and text summarization. The present invention uses natural language to describe the events that are happening in a video and the attributes of objects related to the events, so as to achieve the purpose of describing and summarizing the video content. See the description below for details:

一种基于深度学习和文本总结的视频描述方法，其特征在于，所述视频描述方法包括以下步骤：A video description method based on deep learning and text summarization, characterized in that the video description method comprises the following steps:

从互联网下载视频，并对每个视频进行描述，形成<视频，描述>对，构成文本描述训练集；Download videos from the Internet, and describe each video to form a <video, description> pair to form a text description training set;

通过现有的图像数据集按照图像分类任务训练卷积神经网络模型；Train the convolutional neural network model according to the image classification task through the existing image data set;

对视频提取视频帧序列，并利用卷积神经网络模型提取卷积神经网络特征，构成<视频帧序列，文本描述序列>对作为递归神经网络模型的输入，训练得到递归神经网络模型；Extract video frame sequence to video, and utilize convolutional neural network model to extract convolutional neural network feature, form <video frame sequence, text description sequence> pair as the input of recurrent neural network model, train and obtain recursive neural network model;

通过训练得到的递归神经网络模型对待描述视频的视频帧序列进行描述，得到描述序列；Describe the video frame sequence of the video to be described by the recursive neural network model obtained through training, and obtain the description sequence;

通过基于图的词汇中心度作为文本总结的显著性的方法，对描述序列进行排序，输出视频的最终描述结果。By using graph-based lexical centrality as a saliency method for text summarization, the sequence of descriptions is sorted, and the final description result of the video is output.

所述从互联网下载视频，并对每个视频进行描述，形成<视频，描述>对，构成文本描述训练集具体为：The downloading of videos from the Internet, and describing each video to form a pair of <video, description> to form a text description training set is as follows:

通过现有的视频集合、以及每个视频对应的句子描述组成<视频，描述>对，构成文本描述训练集。The <video, description> pair is composed of the existing video collection and the sentence description corresponding to each video to form a text description training set.

所述对视频提取视频帧序列，并利用卷积神经网络模型提取卷积神经网络特征，构成<视频帧序列，文本描述序列>对作为递归神经网络模型的输入，训练得到递归神经网络模型的步骤具体为：The step of extracting a video frame sequence from the video, extracting convolutional neural network features using a convolutional neural network model, forming a <video frame sequence, text description sequence> pair as the input of a recurrent neural network model, and training to obtain a recurrent neural network model Specifically:

使用训练卷积神经网络模型后的参数，提取图像的卷积神经网络特征，以及图像对应的句子描述进行建模，获取目标函数；Use the parameters after training the convolutional neural network model to extract the convolutional neural network features of the image and the sentence description corresponding to the image for modeling to obtain the objective function;

构造递归神经网络；对于非线性函数通过长短时间记忆网络进行建模；Construct a recurrent neural network; model nonlinear functions through long and short-term memory networks;

使用梯度下降的方式优化目标函数，并得到训练后的长短时间记忆网络参数。Use gradient descent to optimize the objective function and obtain the trained long-short-term memory network parameters.

所述通过训练得到的递归神经网络模型对待描述视频的视频帧序列进行描述，得到描述序列的步骤具体为：Describe the video frame sequence of the video to be described by the recursive neural network model obtained through training, and the steps of obtaining the description sequence are specifically:

利用训练好的模型参数并使用卷积神经网络模型提取每个图像的卷积神经网络特征，得到图像特征；Use the trained model parameters and use the convolutional neural network model to extract the convolutional neural network features of each image to obtain image features;

将图像特征作为输入并利用训练得到的模型参数得到句子描述，从而得到视频对应的句子描述。The image feature is used as input and the model parameters obtained by training are used to obtain a sentence description, so as to obtain a sentence description corresponding to the video.

本发明提供的技术方案的有益效果是：每一个视频由一个帧序列构成，使用卷积神经网络提取视频每一帧的底层特征，采用本方法能够有效避免传统的使用深度学习提取视频特征引入过多的噪点，降低后期生成句子的准确性。使用训练好的循环神经网络将每一帧图片转化成句子，从而生成一个句子的集合。并使用自动文本总结的方法通过计算句子之间的中心度并从句子的集合只中筛选出质量高，具有代表性的句子作为视频的描述，采用本方法能够产生更好的视频描述效果和准确性以及句子的多样性。同时，采用基于深度和文本总结的方法可以有效地推广到视频检索的应用当中，但本方法仅限于对视频内容的英文描述。The beneficial effect of the technical solution provided by the present invention is: each video is composed of a sequence of frames, and the convolutional neural network is used to extract the underlying features of each frame of the video. This method can effectively avoid the traditional use of deep learning to extract video features. A lot of noise will reduce the accuracy of the sentence generated in the later stage. Use the trained recurrent neural network to convert each frame of pictures into sentences to generate a collection of sentences. And use the method of automatic text summarization to calculate the centrality between sentences and select high-quality and representative sentences from the collection of sentences as video descriptions. This method can produce better video description effects and accuracy. Sex and sentence variety. At the same time, the method based on depth and text summarization can be effectively extended to the application of video retrieval, but this method is limited to the English description of video content.

附图说明Description of drawings

图1为一种基于深度学习和文本总结的视频描述方法的流程图；Fig. 1 is a flow chart of a video description method based on deep learning and text summarization;

图2本发明所使用的卷积神经网络模型(CNN)示意图；The convolutional neural network model (CNN) schematic diagram that Fig. 2 present invention uses;

其中，Cov表示卷积核；ReLU表示公式为max(0,x)；Pool表示Pooling操作；LRN为局部相应归一化操作；Softmax为目标函数。Among them, Cov represents the convolution kernel; ReLU represents the formula as max(0,x); Pool represents the Pooling operation; LRN represents the local corresponding normalization operation; Softmax represents the objective function.

图3本发明所使用的递归神经网络示意图；The schematic diagram of the recursive neural network used in the present invention of Fig. 3;

其中，t表示t状态下的输入；h_t-1表示上一状态的隐态；i为inputgate；f为forgetgate；o为outputgate；c为cell；m_t为经过一个LSTM单元后的输出。Among them, t represents the input in the t state; h _t-1 represents the hidden state of the previous state; i is the input gate; f is the forgetgate; o is the output gate; c is the cell; m _t is the output after passing through an LSTM unit.

图4(a)为LexRank剪枝后连接图；Figure 4(a) is the connection diagram after LexRank pruning;

其中，S＝{S₁,…,S₁₀}为经过递归神经网络(RNN)生成的10个句子，采用图模式将这10个句子表示为10个节点；节点与节点之间的相似度通过直线来表示并构成全连接图，连线的粗细表示相似度的大小。Among them, S={S ₁ ,…,S ₁₀ } are 10 sentences generated by the recurrent neural network (RNN), and these 10 sentences are represented as 10 nodes in graph mode; the similarity between nodes is obtained by A straight line is used to represent and form a fully connected graph, and the thickness of the connection represents the size of the similarity.

图4(b)为LexRank初始全连接图；Figure 4(b) is the initial fully connected graph of LexRank;

通过设置阈值，将节点与节点之间相似度较小的连线去除，剩余的节点与节点之间的连线即句子之间的相似度较高。By setting the threshold, the connection lines with low similarity between nodes are removed, and the remaining connections between nodes, that is, the similarity between sentences is relatively high.

图5为部分视频帧经过描述后所产生的句子的示意图。Fig. 5 is a schematic diagram of sentences generated after partial video frames are described.

其中，每帧图像下面为采用本发明中所用的CNN-RNN模型后所生成的句子，其箭头指向部分为经过LexRank方法后对视频文本描述的总结作为该视频的文本描述。Wherein, below each frame of image is the sentence generated after adopting the CNN-RNN model used in the present invention, and its arrow points to a summary of the video text description after the LexRank method as the text description of the video.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚，下面对本发明实施方式作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present invention clearer, the implementation manners of the present invention will be further described in detail below.

基于背景技术中存在的问题，以及在图像中使用深度学习的方法对图像进行描述效果取得显著的提升后，人们从中受到启发，并在视频中运用深度学习的方法，其生成的视频描述的多样性和正确性有了一定的提高。Based on the problems in the background technology, and the effect of using deep learning methods in images to describe images has been significantly improved, people are inspired by it and use deep learning methods in videos. The generated video descriptions are diverse. The accuracy and accuracy have been improved to a certain extent.

为此，本发明实施例提出了一种基于深度学习和文本总结的视频描述方法，首先，本方法通过卷积神经网络框架对视频的每一帧的视觉特征进行提取。然后，将每一个视频特征作为输入到循环神经网络框架中，采用这种框架可以对每一个视觉特征，即视频的每一帧生成一句描述。这样，就得到了一个句子的集合，为了得到最具有表现性并且高质量的句子作为该视频的描述，本方法采用文本总结的方法，通过计算句子之间的相似度对所有句子进行排序，从而避免了一些错误句子和低质量的句子作为视频的最终描述。采用自动文本总结的方法不仅可以得到一个具有代表性的句子，并且具有一定的正确性和可靠性，从而提高了视频描述的准确性。同时，本方法也克服了视频检索所面临的一些技术上的困难。To this end, the embodiment of the present invention proposes a video description method based on deep learning and text summarization. First, this method extracts the visual features of each frame of the video through a convolutional neural network framework. Then, each video feature is used as input into the recurrent neural network framework, which can generate a sentence description for each visual feature, that is, each frame of the video. In this way, a collection of sentences is obtained. In order to obtain the most expressive and high-quality sentences as the description of the video, this method adopts the method of text summarization and sorts all sentences by calculating the similarity between sentences, so that Some wrong sentences and low-quality sentences are avoided as the final description of the video. The method of automatic text summarization can not only get a representative sentence, but also has certain correctness and reliability, thus improving the accuracy of video description. At the same time, this method also overcomes some technical difficulties faced by video retrieval.

实施例1Example 1

一种基于深度学习和文本总结的视频描述方法，参见图1，该方法包括以下步骤：A video description method based on deep learning and text summary, see Figure 1, the method includes the following steps:

101：从互联网下载视频，并对每个视频进行描述(英文描述)，形成<视频，描述>对，构成文本描述训练集，其中每个视频对应多句描述，从而构成一个文本描述序列；101: Download videos from the Internet, and describe each video (in English), form a <video, description> pair, and form a text description training set, wherein each video corresponds to multiple sentence descriptions, thereby forming a text description sequence;

102：利用现有的图像数据集，按照图像分类任务训练卷积神经网络(CNN)模型；102: Utilize the existing image data set to train a convolutional neural network (CNN) model according to the image classification task;

例如：ImageNet。For example: ImageNet.

103：对视频提取视频帧序列，并利用卷积神经网络(CNN)模型提取CNN特征，构成<视频帧序列，文本描述序列>对作为递归神经网络(RNN)模型的输入，训练得到递归神经网络(RNN)模型；103: Extract the video frame sequence from the video, and use the convolutional neural network (CNN) model to extract CNN features, form a <video frame sequence, text description sequence> pair as the input of the recurrent neural network (RNN) model, and train the recurrent neural network (RNN) model;

104：利用训练得到的RNN模型对待描述视频的视频帧序列进行描述，得到描述序列；104: Using the trained RNN model to describe the video frame sequence of the video to be described to obtain a description sequence;

105：利用基于图的词汇中心度作为文本总结的显著性(LexRank)的方法对描述序列的合理性进行排序，选择最合理描述作为对该视频的最终描述。105: Use graph-based lexical centrality as the method of text summary significance (LexRank) to rank the rationality of the description sequence, and select the most reasonable description as the final description of the video.

综上所述，本发明实施例通过步骤101-步骤105实现了通过自然语言描述一段视频中正在发生的事件以及与事件相关的物体属性，从而达到对视频内容进行描述和总结的目的。To sum up, the embodiment of the present invention realizes the description of the event occurring in a video and the object attributes related to the event through natural language through steps 101 to 105, so as to achieve the purpose of describing and summarizing the video content.

实施例2Example 2

201：从互联网下载图像，并对每个视频进行描述，形成<视频，描述>对，构成文本描述训练集；201: Download images from the Internet, and describe each video to form a <video, description> pair to form a text description training set;

该步骤具体包括：This step specifically includes:

(1)从互联网中下载微软研究院视频描述数据集(MicrosoftResearchVideoDescriptionCorpus)，这个数据集包括从YouTube中收集的1970个视频段，数据集可表示为 $V I D = {{Video}_{1}, ..., {Video}_{N_{d}}},$ 其中N_d是集合VID中的视频总数。(1) Download the Microsoft Research Video Description Dataset (MicrosoftResearchVideoDescriptionCorpus) from the Internet. This data set includes 1970 video segments collected from YouTube. The data set can be expressed as $V I D. = {{video}_{1}, ..., {video}_{N_{d}}},$ where _Nd is the video in the collection VID total.

(2)每个视频都会有多个相应的描述，每一个视频的句子描述为Sentences＝{Sentence₁,…,Sentence_N}，其中，N表示每一个视频所对应的句子(Sentence₁,…,Sentence_N)的描述个数。(2) Each video will have multiple corresponding descriptions, and the sentence description of each video is Sentences={Sentence ₁ ,...,Sentence _N }, where N represents the sentence corresponding to each video (Sentence ₁ ,..., Sentence _N ) description number.

(3)通过现有的视频集合VID以及每个视频对应的句子描述Sentences组成<视频，描述>对，构成文本描述训练集。(3) The <video, description> pair is composed of the existing video collection VID and the sentence description Sentences corresponding to each video to form a text description training set.

202：利用现有的图像数据集，按照图像分类任务训练卷积神经网络(CNN)模型，训练CNN模型参数；202: Utilize the existing image data set, train a convolutional neural network (CNN) model according to the image classification task, and train CNN model parameters;

该步骤具体包括：This step specifically includes:

(1)构造图2中所示的AlexNet[1]CNN模型：该模型包括了8个网络层，其中前5层是卷积层，后3层是全连接层。(1) Construct the AlexNet[1]CNN model shown in Figure 2: This model includes 8 network layers, of which the first 5 layers are convolutional layers, and the last 3 layers are fully connected layers.

(2)使用Imagenet作为训练集，将图像数据集中的每一张图片采样到256*256大小的图片， $I M A G E = {{Image}_{1}, ..., {Image}_{N_{m}}}$ 作为输入，N_m为图片的个数，根据图2设置的网络层，第1层可表示为：(2) Use Imagenet as the training set, sample each picture in the image data set to a picture of 256*256 size, $I m A G E. = {{image}_{1}, ..., {image}_{N_{m}}}$ As input, N _m is the picture The number of , according to the network layer set in Figure 2, the first layer can be expressed as:

F₁(IMAGE)＝norm{pool[max(0,W₁*IMAGE+B₁)]}(1)F ₁ (IMAGE)=norm{pool[max(0,W ₁ *IMAGE+B ₁ )]}(1)

其中，IMAGE表示输入图像；W₁表示卷积核参数；B₁表示偏置；F₁(IMAGE)表示为经过第一层网络后的输出结果；Norm表示归一化操作。在这一网络层中，通过线性纠正函数(max(0,x)，x为W₁*IMAGE+B₁)对卷积后的图像进行处理，再经过映射pool操作，并对其进行局部相应归一化(LRN)，其归一化的方式为：Among them, IMAGE represents the input image; W ₁ represents the convolution kernel parameter; B ₁ represents the bias; F ₁ (IMAGE) represents the output result after the first layer of network; Norm represents the normalization operation. In this network layer, the convolved image is processed through the linear correction function (max(0,x), x is W ₁ *IMAGE+B ₁ ), and then the mapping pool operation is performed, and local corresponding Normalization (LRN), the normalization method is:

${b b}^{i i}_{x x,, y the y} = = {a a}^{i i}_{x x,, y the y} / / {((k k + + α α {Σ Σ}_{j j = = max max ((00,, i i - - n no / / 22))}^{min min ((M m - - 11,, i i + + n no / / 22))} {(({a a}^{j j}_{x x,, y the y}))}^{22}))}^{β β} - - - - - - ((22))$

其中，M为pooling之后特征映射的个数；i为M个特征映射中的第i个；n为局部归一化的大小，即每n个特征映射进行归一化；aⁱ _x,y表示在第i个特征映射中坐标(x,y)下所对应的值；k为偏置；α，β为归一化的参数；bⁱ _x,y为经过局部相应归一化(LRN)后的输出结果。Among them, M is the number of feature maps after pooling; i is the i-th of M feature maps; n is the size of local normalization, that is, every n feature maps are normalized; a ⁱ _{x, y} means The value corresponding to the coordinate (x, y) in the i-th feature map; k is the bias; α, β are normalized parameters; b ⁱ _{x, y} are after local corresponding normalization (LRN) output result.

在AlexNet中，k＝2，n＝5，α＝10^-4，β＝0.75。In AlexNet, k=2, n=5, α=10 ⁻⁴ , β=0.75.

继续采用该模型，将F₁(IMAGE)作为第二个网络层的输入，根据第二层网络层，可表示为：Continuing to use this model, F ₁ (IMAGE) is used as the input of the second network layer. According to the second network layer, it can be expressed as:

F₂(IMAGE)＝max(0,W₂*F₁(IMAGE)+B₂)(3)F ₂ (IMAGE)=max(0,W ₂ *F ₁ (IMAGE)+B ₂ )(3)

其中，W₂表示卷积核参数；B₂表示偏置；F₂(IMAGE)表示为经过第二层网络后的输出结果。第一层与第二层的设置相同，只是卷积层与pooling层的映射核kernel的大小发生变化。Among them, W ₂ represents the convolution kernel parameter; B ₂ represents the bias; F ₂ (IMAGE) represents the output result after passing through the second layer network. The settings of the first layer and the second layer are the same, but the size of the mapping kernel kernel of the convolution layer and the pooling layer changes.

按照AlexNet的网络设置，剩余的卷积层可依次表示为：According to AlexNet's network settings, the remaining convolutional layers can be expressed in turn as:

F₃(IMAGE)＝max(0,W₃*F₂(IMAGE)+B₃)(4)F ₃ (IMAGE)=max(0,W ₃ *F ₂ (IMAGE)+B ₃ )(4)

F₄(IMAGE)＝max(0,W₄*F₃(IMAGE)+B₄)(5)F ₄ (IMAGE)=max(0,W ₄ *F ₃ (IMAGE)+B ₄ )(5)

F₅(IMAGE)＝pool[max(0,W₅*F₄(IMAGE)+B₅)](6)F ₅ (IMAGE)=pool[max(0,W ₅ *F ₄ (IMAGE)+B ₅ )](6)

其中，W₃，W₄，W₅以及B₃，B₄，B₅为各层的卷积参数和偏置。Among them, W ₃ , W ₄ , W ₅ and B ₃ , B ₄ , B ₅ are the convolution parameters and biases of each layer.

后3层为全连接层，根据图2的网络层设置可依次表示为：The last three layers are fully connected layers, which can be expressed in sequence according to the network layer settings in Figure 2:

F₆(IMAGE)＝fc[F₅(IMAGE),θ₁](7)F ₆ (IMAGE)=fc[F ₅ (IMAGE),θ ₁ ](7)

F₇(IMAGE)＝fc[F₆(IMAGE),θ₂](8)F ₇ (IMAGE)=fc[F ₆ (IMAGE),θ ₂ ](8)

F₈(IMAGE)＝fc[F₇(IMAGE),θ₃](9)F ₈ (IMAGE)=fc[F ₇ (IMAGE),θ ₃ ](9)

其中，fc表示全连接层，θ₁，θ₂，θ₃表示三个全连接层的参数，并将最后一层的特征F₈(IMAGE)输入到1000个类别的多元分类器进行分类。Among them, fc represents the fully connected layer, θ ₁ , θ ₂ , and θ ₃ represent the parameters of the three fully connected layers, and the feature F ₈ (IMAGE) of the last layer is input to a multi-class classifier with 1000 categories for classification.

(3)根据当前网络，设置多元分类器，其公式可表示为：(3) According to the current network, set a multivariate classifier, and its formula can be expressed as:

$l l ((Θ Θ)) = = {Σ Σ}_{t t = = 11}^{m m} log log p p (({y the y}^{((t t))} | | {x x}^{((t t))};; Θ Θ)) - - - - - - ((1010))$

其中，l(Θ)为目标函数，m为Imagenet中图像的类别，x^(t)为每一类别经过Alexnet网络之后提取的CNN特征，y^(t)为每个图像对应的标签，Θ＝{W_p,B_p,θ_q}，p＝1,...,5，q＝1,2,3，分别为各个网络层中的参数。采用梯度下降的方法对目标函数参数进行优化，从而得到Alexnet网络设置的参数Θ。Among them, l(Θ) is the objective function, m is the category of the image in Imagenet, x ^(t) is the CNN feature extracted after each category passes through the Alexnet network, y ^(t) is the label corresponding to each image, Θ={ W _p , B _p , θ _q }, p=1,...,5, q=1, 2, 3 are parameters in each network layer respectively. The gradient descent method is used to optimize the parameters of the objective function, so as to obtain the parameter Θ set by the Alexnet network.

203：对视频提取视频帧序列，并利用卷积神经网络(CNN)模型提取CNN特征，构成<视频帧序列，文本描述序列>对作为递归神经网络(RNN)模型的输入，训练得到递归神经网络(RNN)模型；203: Extract the video frame sequence from the video, and use the convolutional neural network (CNN) model to extract CNN features, form a <video frame sequence, text description sequence> pair as the input of the recurrent neural network (RNN) model, and train the recurrent neural network (RNN) model;

该步骤具体为：The steps are specifically:

(1)根据步骤201，使用训练CNN模型后的参数，提取图像的CNN特征I，以及图像对应的句子描述S进行建模，其目标函数为：(1) According to step 201, use the parameters after training the CNN model to extract the CNN feature I of the image, and the sentence description S corresponding to the image to model, and its objective function is:

θ^*＝argmax∑logp(S|I；θ)θ ^* = argmax∑logp(S|I; θ)

(11)(11)

其中，(S,I)代表训练数据中的图像-文本对；θ为模型待优化参数；θ*为优化后的参数；Among them, (S, I) represents the image-text pair in the training data; θ is the model parameter to be optimized; θ* is the optimized parameter;

训练的目的是使得所有样本在给定输入图像I的观察下生成的句子的对数概率之和最大，采用条件概率的链式法则计算概率p(S|I；θ)，表达式为：The purpose of the training is to maximize the sum of the logarithmic probabilities of the sentences generated by all samples under the observation of the given input image I. The chain rule of conditional probability is used to calculate the probability p(S|I; θ), the expression is:

$log log p p ((S S | | I I)) = = {Σ Σ}_{t t = = 00}^{N N} log log p p (({S S}_{t t} | | I I,, {S S}_{00},, {S S}_{11},, ... ...,, {S S}_{t t - - 11})) - - - - - - ((1212))$

其中，S₀,S₁,...,S_t-1,S_t表示句子中的单词。对公式中的未知量p(S_t|I,S₀,S₁,...,S_t-1)使用递归神经网络进行建模。Wherein, S ₀ , S ₁ ,...,S _t-1 , S _t represent words in the sentence. The unknown quantity p(S _t |I,S ₀ ,S ₁ ,...,S _t-1 ) in the formula is modeled using a recurrent neural network.

(2)构造递归神经网络(RNN)：(2) Construct a recurrent neural network (RNN):

在t-1个单词作为条件下，并将这些词表示为固定长度的隐态h_t，直到出现新的输入x_t，并通过非线性函数f对隐态进行更新，表达式为：Under the condition of t-1 words, and represent these words as a fixed-length hidden state h _t until a new input x _t appears, and update the hidden state through the nonlinear function f, the expression is:

h_t+1＝f(h_t,x_t)(13)h _t+1 ＝f(h _t ,x _t )(13)

其中，h_t+1表示下一隐态。Among them, h _t+1 represents the next hidden state.

(3)对于非线性函数f，通过构造如图3所示的长短时间记忆网络(LSTM)进行建模；(3) For the nonlinear function f, model it by constructing a long-short-term memory network (LSTM) as shown in Figure 3;

其中，i_t为输入门inputgate，f_t为遗忘门forgetgate，o_t为输出门outputgate，c为细胞cell，各个状态的更新和输出可表示为：Among them, it is the input gate input gate, f _t is the forget gate forgetgate, o _t _is the output gate output gate, c is the cell cell, the update and output of each state can be expressed as:

i_t＝σ(W_ixx_t+W_imm_t-1)(14)i _t =σ(W _ix x _t +W _im m _t-1 )(14)

f_t＝σ(W_fxx_t+W_fmm_t-1)(15)f _t = σ(W _fx x _t +W _fm m _t-1 )(15)

o_t＝σ(W_oxx_t+W_omm_t-1)(16)o _t = σ(W _ox x _t +W _om m _t-1 )(16)

p_t+1＝Softmax(m_t)(19)p _t+1 ＝Softmax(m _t )(19)

其中，表示为gate值之间的乘积，矩阵W＝{W_ix；W_im；W_fx；W_fm；W_ox；W_om；W_cx；W_ix；W_cm}为需要训练的参数，σ(·)为S型函数(例如：σ(W_ixx_t+W_imm_t-1)、σ(W_fxx_t+W_fmm_t-1)为S型函数)，h(·)为双曲线正切函数(例如：h(W_cxx_t+W_cmm_t-1)为双曲线正切函数)。p_t+1为经过Softmax分类后下一个词的概率分布；m_t为当前状态特征。in, Expressed as the product between gate values, the matrix W={W _ix ; W _im ; W _fx ; W _fm ; W _ox ; W _om ; W _cx ; W _ix ; W _cm } is the parameter to be trained, σ( ) is a Sigmoid function (for example: σ(W _ix x _t +W _im m _t-1 ), σ(W _fx x _t +W _fm m _t-1 ) is a Sigmoid function), h( ) is the hyperbolic tangent function (for example: h(W _cx x _t +W _cm m _t-1 ) is a hyperbolic tangent function). p _t+1 is the probability distribution of the next word after Softmax classification; m _t is the current state feature.

(4)使用梯度下降的方式优化目标函数(11)，并得到训练后的长短时间记忆网络LSTM参数W。(4) Optimize the objective function (11) by gradient descent, and obtain the trained long-short-term memory network LSTM parameter W.

204：利用训练得到的RNN模型对待描述视频的视频帧序列进行描述，得到描述序列，进行预测的步骤如下：204: Using the trained RNN model to describe the video frame sequence of the video to be described, to obtain the description sequence, the steps for prediction are as follows:

(1)提取测试集 ${VID}^{t} = {{Video}^{t}_{1}, ..., {Video}^{t}_{N_{t}}},$ N_t为测试集视频的个数，t为测试集视频，并对每一个视频提取10帧图像，可表示为： ${Image}^{t} = {{Image}^{t}_{1}, ..., {Image}^{t}_{10}} .$ (1) extraction test set ${VID}^{t} = {{video}^{t}_{1}, ..., {video}^{t}_{N_{t}}},$ N _t is the number of video in the test set, t is the video in the test set, and 10 frames of images are extracted for each video, which can be expressed as: ${image}^{t} = {{image}^{t}_{1}, ..., {image}^{t}_{10}} .$

(2)利用训练好的模型参数Θ＝{W_i,B_i,θ_j}，i＝1,...,5，j＝1,2,3，并使用CNN模型提取Image^t中每个图像的CNN特征，得到图像特征I^t＝{I^t ₁,…,I^t ₁₀}。(2) Use the trained model parameters Θ={W _i ,B _i ,θ _j }, i=1,...,5, ^j =1,2,3, and use the CNN model to extract each The CNN feature of the image, the image feature I ^t ={I ^t ₁ ,...,I ^t ₁₀ } is obtained.

(3)将图像特征I^t作为输入并利用训练得到的模型参数W求得公式(12)，得到句子描述S＝{S₁,…,S_n}。从而得到该视频对应的句子描述。(3) The image feature I ^t is used as an input and the model parameter W obtained from training is used to obtain the formula (12), and the sentence description S={S ₁ ,...,S _n } is obtained. Thus, the sentence description corresponding to the video is obtained.

205：利用LexRank的方法对描述序列的合理性进行排序，选择最合理描述作为对该视频的最终描述。205: Use the method of LexRank to sort the rationality of the description sequence, and select the most reasonable description as the final description of the video.

(1)通过使用RNN模型对视频特征序列I^t＝{I^t ₁,…,I^t ₁₀}进行测试，生成相应的句子集合S＝{S₁,…S_i,…,S_n}。(1) Test the video feature sequence I ^t ={I ^t ₁ ,...,I ^t ₁₀ } by using the RNN model, and generate a corresponding sentence set S={S ₁ ,...S _i ,...,S _n }.

(2)生成句子特征，顺序扫描所有句子集合中S中每一个句子S_i中的所有单词，其中i＝1,…,N_d，每个不同单词保留一个，形成单词列表表示的词汇表VOL＝{w_i,…,w_Nw}，其中N_w是词汇表VOL中的单词总数。对词汇表VOL中的每个单词w_i，顺序扫描句子集合S中的每句子S_j，统计每个单词w_i在每个句子S_j中出现的次数n_ij，其中j＝1,…,N_s,N_s是句子总数，并统计集合S中包含单词w_i的伴随文本个数num(w_i)；根据公式(20)计算每个单词w_i在每个句子S_j中的词频tf(w_i,s_j)，其中i＝1,…,N_d,N_d是词汇表中单词总数，j＝1,…,N_s,N_s是集合中所有句子S总数；(2) Generate sentence features, sequentially scan all words in each sentence S _i in S in all sentence sets, where i=1,...,N _d , keep one for each different word, and form a vocabulary VOL represented by a word list ={w _i ,...,w _Nw }, where N _w is the total number of words in the vocabulary VOL. For each word w _i in the vocabulary VOL, sequentially scan each sentence S _j in the sentence set S, and count the number of occurrences n _ij of each word w _i in each sentence S _j , where j=1,..., N _s , N _s is the total number of sentences, and count the number of accompanying texts num(w _i ) containing word w _i in the set S; calculate the word frequency tf of each word w _i in each sentence S _j according to formula (20) (w _i , s _j ), where i=1,...,N _d , N _d is the total number of words in the vocabulary, j=1,...,N _s , N _s is the total number of all sentences S in the collection;

$t t f f (({w w}_{i i},, {s the s}_{j j})) = = {n no}_{i i j j} / / {Σ Σ}_{k k = = 11}^{{N N}_{w w}} {n no}_{k k j j} - - - - - - ((2020))$

其中，n_kj为第k个词在第j个句子中出现的个数。Among them, n _kj is the number of occurrences of the kth word in the jth sentence.

对词汇表VOL中的每个单词w_i，根据公式(21)计算其逆文档词频idf(w_i)；For each word w _i in the vocabulary VOL, calculate its inverse document word frequency idf(w _i ) according to formula (21);

idf(w_i)＝log(N_d/num(w_i))(21)idf(w _i )=log(N _d /num(w _i ))(21)

其中，N_d为每个句子单词的个数。Among them, N _d is the number of words in each sentence.

根据向量空间模型，将集合S中每个句子S_j表示成N_w维向量，第i维对应词汇表中的单词w_i，其值为tfidf(w_i)，计算公式如下：According to the vector space model, each sentence S _j in the set S is expressed as an N _w -dimensional vector, the i-th dimension corresponds to the word w _i in the vocabulary, and its value is tfidf(w _i ), the calculation formula is as follows:

tfidf(w_i)＝tf(w_i,s_j)×idf(w_i)(22)tfidf(w _i )=tf(w _i ,s _j )×idf(w _i )(22)

(3)采用两个向量S_i，S_j之间的余弦值作为句子相似度，计算公式如下：(3) The cosine value between the two vectors S _i and S _j is used as the sentence similarity, and the calculation formula is as follows:

$s the s i i m m l l i i a a r r i i t t y the y (({S S}_{i i},, {S S}_{j j})) = = \frac{{Σ Σ}_{w w &Element; &Element; {S S}_{i i},, {S S}_{j j}} {tf tf}_{w w,, {S S}_{i i}} {tf tf}_{w w,, {S S}_{j j}} {(({idf idf}_{w w}))}^{22}}{\sqrt{{Σ Σ}_{{s the s}_{m m} &Element; &Element; {S S}_{i i}} {(({tf tf}_{{s the s}_{m m},, {S S}_{i i}} {idf idf}_{{s the s}_{m m}}))}^{22}} \times \times \sqrt{{Σ Σ}_{{i i}_{n no} &Element; &Element; S S j j} {(({tf tf}_{{s the s}_{n no},, {S S}_{j j}} {idf idf}_{{s the s}_{n no}}))}^{22}}} - - - - - - ((23 twenty three))$

其中，为每个单词w在句子S_i中的词频；为每个单词w在句子S_j中的词频；idf_w为每个单词的逆文档词频；s_m为句子S_i中任意一个单词；为单词s_m在S_i中的词频；为单词s_m的逆文档词频；s_n为句子S_j中任意一个单词；为单词s_n在S_j中的词频；为单词s_n的逆文档词频。in, is the word frequency of each word w in sentence S _i ; is the word frequency of each word w in the sentence S _j ; idf _w is the inverse document word frequency of each word; s _m is any word in the sentence S _i ; is the word frequency of word s _m in S _i ; is the inverse document word frequency of the word s _m ; s _n is any word in the sentence S _j ; is the word frequency of word s _n in S _j ; is the inverse document frequency of word s _n .

并形成全连接无向图，如图4(a)，每个节点u_i为句子S_i，节点之间边作为为句子相似度。And form a fully connected undirected graph, as shown in Figure 4(a), each node u _i is a sentence S _i , and the edges between nodes are taken as sentence similarity.

(4)设置阈值Degree，将所有相似度similarity小于Degree的边删除，如图4(b)。(4) Set the threshold Degree, and delete all edges whose similarity is less than Degree, as shown in Figure 4(b).

(5)计算每个句子节点u_i的LexRank分数LR，每个句子节点的初始分数为：d/N,其中N为句子节点个数，d为阻尼因子，d通常选在[0.1,0.2]之间，根据公式(4)计算分数LR：(5) Calculate the LexRank score LR of each sentence node u _i , the initial score of each sentence node is: d/N, where N is the number of sentence nodes, d is the damping factor, and d is usually selected in [0.1,0.2] Between, the score LR is calculated according to the formula (4):

$L L R R ((u u)) = = ((11 - - d d)) \underset{v v &Element; &Element; a a d d j j [[u u]]}{Σ Σ} \frac{L L R R ((v v))}{deg deg ((v v))} + + \frac{d d}{N N} - - - - - - ((24 twenty four))$

其中，deg(v)为节点v的阈值；LR(u)为节点u的分数；LR(v)为节点v的分数。Among them, deg(v) is the threshold of node v; LR(u) is the score of node u; LR(v) is the score of node v.

(6)计算每个句子节点的LR分数，并排序，选择分数最高的句子作为视频的最终描述。(6) Calculate the LR score of each sentence node, and sort them, and select the sentence with the highest score as the final description of the video.

综上所述，本发明实施例通过步骤201-步骤205实现了通过自然语言描述一段视频中正在发生的事件以及与事件相关的物体属性，从而达到对视频内容进行描述和总结的目的。To sum up, the embodiment of the present invention realizes the description of the event occurring in a video and the object attributes related to the event through natural language through steps 201 to 205, so as to achieve the purpose of describing and summarizing the video content.

实施例3Example 3

这里选取两个视频作为待描述视频，如图5所示，使用本发明中基于深度学习和文本总结的方法对其进行预测输出相应的视频描述：Here, two videos are selected as videos to be described, as shown in Figure 5, using the method based on deep learning and text summarization in the present invention to predict and output corresponding video descriptions:

(1)使用ImageNet作为训练集，将数据集中的每一张图片采样到256*256大小的图片， $I M A G E = {{Image}_{1}, ..., {Image}_{N_{m}}}$ 作为输入，N_m为图片的个数。(1) Use ImageNet as the training set, sample each picture in the data set to a picture of 256*256 size, $I m A G E. = {{image}_{1}, ..., {image}_{N_{m}}}$ As input, N _m is the number of pictures.

(2)搭建第一层卷积层，设置卷积核cov1大小为11，步长stride为4，选择ReLU为max(0,x)，对卷积后的featuremap进行pooling操作，核的大小为3，步长stride为2，并使用局部相应归一化对卷积后的数据进行归一化。在AlexNet中，k＝2,n＝5,α＝10^-4,β＝0.75。(2) Build the first convolutional layer, set the convolution kernel cov1 size to 11, stride to 4, select ReLU as max(0,x), perform pooling operation on the convolutional featuremap, and the kernel size is 3. The stride is 2, and the convolutional data is normalized using local corresponding normalization. In AlexNet, k=2, n=5, α=10 ⁻⁴ , β=0.75.

(3)搭建第二层卷积层，设置卷积核cov2大小为5，步长stride为1，选择ReLU为max(0,x)，对卷积后的featuremap进行pooling操作，核的大小为3，步长stride为2，并使用局部相应归一化对卷积后的数据进行归一化。(3) Build the second convolutional layer, set the convolution kernel cov2 size to 5, stride to 1, select ReLU as max(0,x), perform pooling operation on the convolutional featuremap, and the kernel size is 3. The stride is 2, and the convolutional data is normalized using local corresponding normalization.

(4)搭建第三层卷积层，设置卷积核cov3大小为3，步长stride为1，选择ReLU为max(0,x)。(4) Build the third convolutional layer, set the convolution kernel cov3 size to 3, stride to 1, and select ReLU as max(0,x).

(5)搭建第四层卷积层，设置卷积核cov4大小为3，步长stride为1，选择ReLU为max(0,x)。(5) Build the fourth convolution layer, set the convolution kernel cov4 size to 3, stride to 1, and select ReLU to max(0,x).

(6)搭建第五层卷积层，设置卷积核cov5大小为3，步长stride为1，选择ReLU为max(0,x)，并对卷积后的featuremap进行pooling操作，核的大小为3，步长stride为2。(6) Build the fifth convolutional layer, set the convolution kernel cov5 size to 3, stride to 1, select ReLU as max(0,x), and perform pooling operation on the convolutional featuremap, the size of the kernel is 3, and the stride is 2.

(7)搭建第六层全连接层，设置该层为fc6，选择ReLU为max(0,x)，对处理后的数据进行dropout。(7) Build the sixth fully connected layer, set this layer as fc6, select ReLU as max(0,x), and perform dropout on the processed data.

(8)搭建第七层全连接层，设置该层为fc7，选择ReLU为max(0,x)，对处理后的数据进行dropout。(8) Build the seventh fully connected layer, set this layer as fc7, select ReLU as max(0,x), and perform dropout on the processed data.

(9)搭建第八层全连接层，设置该层为fc8，并加入Softmax分类器作为目标函数。(9) Build the eighth fully connected layer, set this layer as fc8, and add the Softmax classifier as the objective function.

(10)通过设置上述八层网络层，建立卷积神经网络(CNN)模型。(10) Establish a convolutional neural network (CNN) model by setting the above-mentioned eight-layer network layer.

(11)训练CNN模型参数。(11) Training CNN model parameters.

(12)数据处理：将数据集中的每个视频均匀提取10帧，并采样到256*256大小。并将图像输入到训练好的CNN模型中得到图像特征，每帧图像随机对应该视频的5句文本表述作为图像-文本对(12) Data processing: 10 frames are evenly extracted from each video in the data set, and sampled to a size of 256*256. And input the image into the trained CNN model to obtain image features, and each frame of image randomly corresponds to the 5 sentences of the video as an image-text pair

(13)构造递归神经网络(RNN)模型。(13) Construct a recurrent neural network (RNN) model.

图5为经过本发明后所产生的视频文本描述结果。图中的图片部分为从视频中提取的视频帧，每帧图像对应的句子为视频特征经过语言模型后所得到的结果。图片下半部分表示经过总结后，只采用视频特征和通过图像迁移所生成的句子以及该视频原本的描述。Fig. 5 is the video text description result generated by the present invention. The picture part in the figure is the video frame extracted from the video, and the sentence corresponding to each frame of image is the result obtained after the video features pass through the language model. The lower part of the picture shows that after summarization, only the video features and sentences generated by image migration and the original description of the video are used.

综上所述，本发明实施例将每一个视频的帧序列通过卷积神经网络和循环神经网络转化成一系列的句子，并通过文本总结的方法，从众多的句子中筛选出质量高并具有代表性的句子。用户可以使用这种方法得到视频的描述，其描述的准确性较高，并且可以推广到视频的检索中去。In summary, the embodiment of the present invention converts the frame sequence of each video into a series of sentences through the convolutional neural network and the cyclic neural network, and selects high-quality and representative sentences from a large number of sentences through the method of text summarization. sex sentences. Users can use this method to obtain video descriptions, which have high accuracy and can be extended to video retrieval.

参考文献references

[1]KrizhevskyA,SutskeverI,HintonG.基于深度卷积神经网络的图像分类方法[J].神经信息处理系统进展,2012.[1] KrizhevskyA, SutskeverI, HintonG. Image classification method based on deep convolutional neural network [J]. Neural Information Processing System Progress, 2012.

本领域技术人员可以理解附图只是一个优选实施例的示意图，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the serial numbers of the above-mentioned embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims

1. A video description method based on deep learning and text summary, characterized in that, the video description method comprises the following steps:

Download videos from the Internet, and describe each video to form a <video, description> pair to form a text description training set;

Train the convolutional neural network model according to the image classification task through the existing image data set;

Extract video frame sequence to video, and utilize convolutional neural network model to extract convolutional neural network feature, form <video frame sequence, text description sequence> pair as the input of recurrent neural network model, train and obtain recursive neural network model;

Describe the video frame sequence of the video to be described by the recursive neural network model obtained through training, and obtain the description sequence;

By using graph-based lexical centrality as a saliency method for text summarization, the sequence of descriptions is sorted, and the final description result of the video is output.

2. A kind of video description method based on deep learning and text summary according to claim 1, characterized in that, downloading videos from the Internet, and describing each video, forming <video, description> pair, constitute The text description training set is specifically:

The <video, description> pair is composed of the existing video collection and the sentence description corresponding to each video to form a text description training set.

3. A kind of video description method based on deep learning and text summary according to claim 1, characterized in that, the video extracts a video frame sequence, and utilizes a convolutional neural network model to extract convolutional neural network features to form <Video frame sequence, text description sequence> For the input of the recurrent neural network model, the steps to train the recurrent neural network model are as follows:

Use the parameters after training the convolutional neural network model to extract the convolutional neural network features of the image and the sentence description corresponding to the image for modeling to obtain the objective function;

Construct a recurrent neural network; model nonlinear functions through long and short-term memory networks;

Use gradient descent to optimize the objective function and obtain the trained long-short-term memory network parameters.

4. A kind of video description method based on deep learning and text summary according to claim 1, characterized in that, the recursive neural network model obtained through training describes the video frame sequence of the video to be described, and obtains the description sequence The specific steps are:

Use the trained model parameters and use the convolutional neural network model to extract the convolutional neural network features of each image to obtain image features;

The image feature is used as input and the model parameters obtained by training are used to obtain a sentence description, so as to obtain a sentence description corresponding to the video.