CN105279495A - Video description method based on deep learning and text summarization - Google Patents

Video description method based on deep learning and text summarization Download PDF

Info

Publication number
CN105279495A
CN105279495A CN201510697454.3A CN201510697454A CN105279495A CN 105279495 A CN105279495 A CN 105279495A CN 201510697454 A CN201510697454 A CN 201510697454A CN 105279495 A CN105279495 A CN 105279495A
Authority
CN
China
Prior art keywords
video
description
neural network
network model
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510697454.3A
Other languages
Chinese (zh)
Other versions
CN105279495B (en
Inventor
李广
马书博
韩亚洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Wellthinker Automation Technology Co ltd
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201510697454.3A priority Critical patent/CN105279495B/en
Publication of CN105279495A publication Critical patent/CN105279495A/en
Application granted granted Critical
Publication of CN105279495B publication Critical patent/CN105279495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video description method based on deep learning and text summarization. The video description method comprises the following steps: through a traditional image data set, training a convolutional neural network model according to an image classification task; extracting a video frame sequence of a video, utilizing the convolutional neural network model to extract convolutional neural network characteristics to form a <video frame sequence, text description sequence> pair which is used as the input of a recurrent neural network model, and training to obtain the recurrent neural network model; describing the video frame sequence of the video to be described through the recurrent neural network model obtained by training to obtain description sequences; and through a method that graph-based vocabulary centrality is used as the significance of the text summarization, sorting the description sequences, and outputting a final description result of the video. An event which happens in one video and object attributes associated with the event are described through a natural language so as to achieve a purpose that video contents are described and summarized.

Description

一种基于深度学习和文本总结的视频描述方法A video description method based on deep learning and text summarization

技术领域technical field

本发明涉及视频描述领域,尤其涉及一种基于深度学习和文本总结的视频描述方法。The invention relates to the field of video description, in particular to a video description method based on deep learning and text summarization.

背景技术Background technique

使用自然语言对一个视频进行描述,无论是对该视频的理解还是在Web检索该视频都是极其重要的。同时,视频的语言描述也是多媒体和计算机视觉领域中重点研究的课题。所谓视频描述,是指对给定的视频,通过观察它所包含的内容,即获得视频特征,并根据这些内容,生成相应的句子。当人们看到一个视频时,特别是一些动作类别的视频,在观看完视频后会对该视频有一定程度的了解,并可以通过语言去讲述视频中所发生的事情。例如:使用“一个人正在骑摩托”这样的句子对视频进行描述。然而,面对大量的视频,采用人工的方式对视频进行逐一的描述需要大量的时间,人力和财力。使用计算机技术对视频特征进行分析,并与自然语言处理的方法进行结合,生成对视频的描述是非常有必要的。一方面,通过视频描述的方法,人们可以从语义的角度更加精确的去理解视频。另一方面,在视频检索领域,当用户输入一段文字性的描述来检索出相应的视频这件事情是非常困难的并且具有一定的挑战。Using natural language to describe a video is extremely important for both understanding the video and retrieving the video on the Web. At the same time, the language description of video is also a key research topic in the field of multimedia and computer vision. The so-called video description means that for a given video, by observing the content contained in it, the video features are obtained, and corresponding sentences are generated according to these contents. When people see a video, especially some action videos, they will have a certain degree of understanding of the video after watching the video, and can use language to tell what happened in the video. Example: Describe a video with a sentence like "A man is riding a motorcycle." However, in the face of a large number of videos, it takes a lot of time, manpower and financial resources to manually describe the videos one by one. It is very necessary to use computer technology to analyze video features and combine them with natural language processing methods to generate video descriptions. On the one hand, through the method of video description, people can understand videos more precisely from the perspective of semantics. On the other hand, in the field of video retrieval, when a user enters a textual description to retrieve the corresponding video, it is very difficult and has certain challenges.

在过去的几年中已经涌现出了各种各样的视频描述方法,比如:通过对视频特征进行分析,可以识别视频中存在的物体,以及物体之间所具有的动作关系。然后采用固定的语言模板:主语+动词+宾语,从所识别物体中确定主语、宾语以及将物体之间的动作关系作为谓语,采用这样的方式生成句子对视频的描述。In the past few years, a variety of video description methods have emerged. For example, by analyzing video features, objects in the video and the action relationship between objects can be identified. Then use a fixed language template: subject + verb + object, determine the subject and object from the recognized objects, and use the action relationship between objects as the predicate, and use this method to generate sentences to describe the video.

但是这样的方法存在一定的局限性,例如:使用语言模板生成句子容易导致生成的句子句式相对固定,句式过于单一,缺乏人类自然语言表达的色彩。同时,识别视频中的物体和动作等均需要使用不同的特征,造成步骤相对繁琐,并需要大量的时间对视频特征进行训练。不仅如此,识别的准确率直接影响生成句子的好坏,这种分步式的方法需要在每个步骤保证较高的正确性,实现有一定的困难。However, there are certain limitations in such a method. For example, the use of language templates to generate sentences can easily lead to a relatively fixed sentence structure, which is too single and lacks the color of human natural language expression. At the same time, different features are required to recognize objects and actions in the video, which makes the steps relatively cumbersome and requires a lot of time to train the video features. Not only that, the accuracy of recognition directly affects the quality of generated sentences. This step-by-step method needs to ensure high accuracy at each step, and it is difficult to implement.

发明内容Contents of the invention

本发明提供了一种基于深度学习和文本总结的视频描述方法,本发明通过自然语言描述一段视频中正在发生的事件以及与事件相关的物体属性,从而达到对视频内容进行描述和总结的目的,详见下文描述:The present invention provides a video description method based on deep learning and text summarization. The present invention uses natural language to describe the events that are happening in a video and the attributes of objects related to the events, so as to achieve the purpose of describing and summarizing the video content. See the description below for details:

一种基于深度学习和文本总结的视频描述方法,其特征在于,所述视频描述方法包括以下步骤:A video description method based on deep learning and text summarization, characterized in that the video description method comprises the following steps:

从互联网下载视频,并对每个视频进行描述,形成<视频,描述>对,构成文本描述训练集;Download videos from the Internet, and describe each video to form a <video, description> pair to form a text description training set;

通过现有的图像数据集按照图像分类任务训练卷积神经网络模型;Train the convolutional neural network model according to the image classification task through the existing image data set;

对视频提取视频帧序列,并利用卷积神经网络模型提取卷积神经网络特征,构成<视频帧序列,文本描述序列>对作为递归神经网络模型的输入,训练得到递归神经网络模型;Extract video frame sequence to video, and utilize convolutional neural network model to extract convolutional neural network feature, form <video frame sequence, text description sequence> pair as the input of recurrent neural network model, train and obtain recursive neural network model;

通过训练得到的递归神经网络模型对待描述视频的视频帧序列进行描述,得到描述序列;Describe the video frame sequence of the video to be described by the recursive neural network model obtained through training, and obtain the description sequence;

通过基于图的词汇中心度作为文本总结的显著性的方法,对描述序列进行排序,输出视频的最终描述结果。By using graph-based lexical centrality as a saliency method for text summarization, the sequence of descriptions is sorted, and the final description result of the video is output.

所述从互联网下载视频,并对每个视频进行描述,形成<视频,描述>对,构成文本描述训练集具体为:The downloading of videos from the Internet, and describing each video to form a pair of <video, description> to form a text description training set is as follows:

通过现有的视频集合、以及每个视频对应的句子描述组成<视频,描述>对,构成文本描述训练集。The <video, description> pair is composed of the existing video collection and the sentence description corresponding to each video to form a text description training set.

所述对视频提取视频帧序列,并利用卷积神经网络模型提取卷积神经网络特征,构成<视频帧序列,文本描述序列>对作为递归神经网络模型的输入,训练得到递归神经网络模型的步骤具体为:The step of extracting a video frame sequence from the video, extracting convolutional neural network features using a convolutional neural network model, forming a <video frame sequence, text description sequence> pair as the input of a recurrent neural network model, and training to obtain a recurrent neural network model Specifically:

使用训练卷积神经网络模型后的参数,提取图像的卷积神经网络特征,以及图像对应的句子描述进行建模,获取目标函数;Use the parameters after training the convolutional neural network model to extract the convolutional neural network features of the image and the sentence description corresponding to the image for modeling to obtain the objective function;

构造递归神经网络;对于非线性函数通过长短时间记忆网络进行建模;Construct a recurrent neural network; model nonlinear functions through long and short-term memory networks;

使用梯度下降的方式优化目标函数,并得到训练后的长短时间记忆网络参数。Use gradient descent to optimize the objective function and obtain the trained long-short-term memory network parameters.

所述通过训练得到的递归神经网络模型对待描述视频的视频帧序列进行描述,得到描述序列的步骤具体为:Describe the video frame sequence of the video to be described by the recursive neural network model obtained through training, and the steps of obtaining the description sequence are specifically:

利用训练好的模型参数并使用卷积神经网络模型提取每个图像的卷积神经网络特征,得到图像特征;Use the trained model parameters and use the convolutional neural network model to extract the convolutional neural network features of each image to obtain image features;

将图像特征作为输入并利用训练得到的模型参数得到句子描述,从而得到视频对应的句子描述。The image feature is used as input and the model parameters obtained by training are used to obtain a sentence description, so as to obtain a sentence description corresponding to the video.

本发明提供的技术方案的有益效果是:每一个视频由一个帧序列构成,使用卷积神经网络提取视频每一帧的底层特征,采用本方法能够有效避免传统的使用深度学习提取视频特征引入过多的噪点,降低后期生成句子的准确性。使用训练好的循环神经网络将每一帧图片转化成句子,从而生成一个句子的集合。并使用自动文本总结的方法通过计算句子之间的中心度并从句子的集合只中筛选出质量高,具有代表性的句子作为视频的描述,采用本方法能够产生更好的视频描述效果和准确性以及句子的多样性。同时,采用基于深度和文本总结的方法可以有效地推广到视频检索的应用当中,但本方法仅限于对视频内容的英文描述。The beneficial effect of the technical solution provided by the present invention is: each video is composed of a sequence of frames, and the convolutional neural network is used to extract the underlying features of each frame of the video. This method can effectively avoid the traditional use of deep learning to extract video features. A lot of noise will reduce the accuracy of the sentence generated in the later stage. Use the trained recurrent neural network to convert each frame of pictures into sentences to generate a collection of sentences. And use the method of automatic text summarization to calculate the centrality between sentences and select high-quality and representative sentences from the collection of sentences as video descriptions. This method can produce better video description effects and accuracy. Sex and sentence variety. At the same time, the method based on depth and text summarization can be effectively extended to the application of video retrieval, but this method is limited to the English description of video content.

附图说明Description of drawings

图1为一种基于深度学习和文本总结的视频描述方法的流程图;Fig. 1 is a flow chart of a video description method based on deep learning and text summarization;

图2本发明所使用的卷积神经网络模型(CNN)示意图;The convolutional neural network model (CNN) schematic diagram that Fig. 2 present invention uses;

其中,Cov表示卷积核;ReLU表示公式为max(0,x);Pool表示Pooling操作;LRN为局部相应归一化操作;Softmax为目标函数。Among them, Cov represents the convolution kernel; ReLU represents the formula as max(0,x); Pool represents the Pooling operation; LRN represents the local corresponding normalization operation; Softmax represents the objective function.

图3本发明所使用的递归神经网络示意图;The schematic diagram of the recursive neural network used in the present invention of Fig. 3;

其中,t表示t状态下的输入;ht-1表示上一状态的隐态;i为inputgate;f为forgetgate;o为outputgate;c为cell;mt为经过一个LSTM单元后的输出。Among them, t represents the input in the t state; h t-1 represents the hidden state of the previous state; i is the input gate; f is the forgetgate; o is the output gate; c is the cell; m t is the output after passing through an LSTM unit.

图4(a)为LexRank剪枝后连接图;Figure 4(a) is the connection diagram after LexRank pruning;

其中,S={S1,…,S10}为经过递归神经网络(RNN)生成的10个句子,采用图模式将这10个句子表示为10个节点;节点与节点之间的相似度通过直线来表示并构成全连接图,连线的粗细表示相似度的大小。Among them, S={S 1 ,…,S 10 } are 10 sentences generated by the recurrent neural network (RNN), and these 10 sentences are represented as 10 nodes in graph mode; the similarity between nodes is obtained by A straight line is used to represent and form a fully connected graph, and the thickness of the connection represents the size of the similarity.

图4(b)为LexRank初始全连接图;Figure 4(b) is the initial fully connected graph of LexRank;

通过设置阈值,将节点与节点之间相似度较小的连线去除,剩余的节点与节点之间的连线即句子之间的相似度较高。By setting the threshold, the connection lines with low similarity between nodes are removed, and the remaining connections between nodes, that is, the similarity between sentences is relatively high.

图5为部分视频帧经过描述后所产生的句子的示意图。Fig. 5 is a schematic diagram of sentences generated after partial video frames are described.

其中,每帧图像下面为采用本发明中所用的CNN-RNN模型后所生成的句子,其箭头指向部分为经过LexRank方法后对视频文本描述的总结作为该视频的文本描述。Wherein, below each frame of image is the sentence generated after adopting the CNN-RNN model used in the present invention, and its arrow points to a summary of the video text description after the LexRank method as the text description of the video.

具体实施方式detailed description

为使本发明的目的、技术方案和优点更加清楚,下面对本发明实施方式作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present invention clearer, the implementation manners of the present invention will be further described in detail below.

基于背景技术中存在的问题,以及在图像中使用深度学习的方法对图像进行描述效果取得显著的提升后,人们从中受到启发,并在视频中运用深度学习的方法,其生成的视频描述的多样性和正确性有了一定的提高。Based on the problems in the background technology, and the effect of using deep learning methods in images to describe images has been significantly improved, people are inspired by it and use deep learning methods in videos. The generated video descriptions are diverse. The accuracy and accuracy have been improved to a certain extent.

为此,本发明实施例提出了一种基于深度学习和文本总结的视频描述方法,首先,本方法通过卷积神经网络框架对视频的每一帧的视觉特征进行提取。然后,将每一个视频特征作为输入到循环神经网络框架中,采用这种框架可以对每一个视觉特征,即视频的每一帧生成一句描述。这样,就得到了一个句子的集合,为了得到最具有表现性并且高质量的句子作为该视频的描述,本方法采用文本总结的方法,通过计算句子之间的相似度对所有句子进行排序,从而避免了一些错误句子和低质量的句子作为视频的最终描述。采用自动文本总结的方法不仅可以得到一个具有代表性的句子,并且具有一定的正确性和可靠性,从而提高了视频描述的准确性。同时,本方法也克服了视频检索所面临的一些技术上的困难。To this end, the embodiment of the present invention proposes a video description method based on deep learning and text summarization. First, this method extracts the visual features of each frame of the video through a convolutional neural network framework. Then, each video feature is used as input into the recurrent neural network framework, which can generate a sentence description for each visual feature, that is, each frame of the video. In this way, a collection of sentences is obtained. In order to obtain the most expressive and high-quality sentences as the description of the video, this method adopts the method of text summarization and sorts all sentences by calculating the similarity between sentences, so that Some wrong sentences and low-quality sentences are avoided as the final description of the video. The method of automatic text summarization can not only get a representative sentence, but also has certain correctness and reliability, thus improving the accuracy of video description. At the same time, this method also overcomes some technical difficulties faced by video retrieval.

实施例1Example 1

一种基于深度学习和文本总结的视频描述方法,参见图1,该方法包括以下步骤:A video description method based on deep learning and text summary, see Figure 1, the method includes the following steps:

101:从互联网下载视频,并对每个视频进行描述(英文描述),形成<视频,描述>对,构成文本描述训练集,其中每个视频对应多句描述,从而构成一个文本描述序列;101: Download videos from the Internet, and describe each video (in English), form a <video, description> pair, and form a text description training set, wherein each video corresponds to multiple sentence descriptions, thereby forming a text description sequence;

102:利用现有的图像数据集,按照图像分类任务训练卷积神经网络(CNN)模型;102: Utilize the existing image data set to train a convolutional neural network (CNN) model according to the image classification task;

例如:ImageNet。For example: ImageNet.

103:对视频提取视频帧序列,并利用卷积神经网络(CNN)模型提取CNN特征,构成<视频帧序列,文本描述序列>对作为递归神经网络(RNN)模型的输入,训练得到递归神经网络(RNN)模型;103: Extract the video frame sequence from the video, and use the convolutional neural network (CNN) model to extract CNN features, form a <video frame sequence, text description sequence> pair as the input of the recurrent neural network (RNN) model, and train the recurrent neural network (RNN) model;

104:利用训练得到的RNN模型对待描述视频的视频帧序列进行描述,得到描述序列;104: Using the trained RNN model to describe the video frame sequence of the video to be described to obtain a description sequence;

105:利用基于图的词汇中心度作为文本总结的显著性(LexRank)的方法对描述序列的合理性进行排序,选择最合理描述作为对该视频的最终描述。105: Use graph-based lexical centrality as the method of text summary significance (LexRank) to rank the rationality of the description sequence, and select the most reasonable description as the final description of the video.

综上所述,本发明实施例通过步骤101-步骤105实现了通过自然语言描述一段视频中正在发生的事件以及与事件相关的物体属性,从而达到对视频内容进行描述和总结的目的。To sum up, the embodiment of the present invention realizes the description of the event occurring in a video and the object attributes related to the event through natural language through steps 101 to 105, so as to achieve the purpose of describing and summarizing the video content.

实施例2Example 2

201:从互联网下载图像,并对每个视频进行描述,形成<视频,描述>对,构成文本描述训练集;201: Download images from the Internet, and describe each video to form a <video, description> pair to form a text description training set;

该步骤具体包括:This step specifically includes:

(1)从互联网中下载微软研究院视频描述数据集(MicrosoftResearchVideoDescriptionCorpus),这个数据集包括从YouTube中收集的1970个视频段,数据集可表示为 V I D = { Video 1 , ... , Video N d } , 其中Nd是集合VID中的视频总数。(1) Download the Microsoft Research Video Description Dataset (MicrosoftResearchVideoDescriptionCorpus) from the Internet. This data set includes 1970 video segments collected from YouTube. The data set can be expressed as V I D. = { video 1 , ... , video N d } , where Nd is the video in the collection VID total.

(2)每个视频都会有多个相应的描述,每一个视频的句子描述为Sentences={Sentence1,…,SentenceN},其中,N表示每一个视频所对应的句子(Sentence1,…,SentenceN)的描述个数。(2) Each video will have multiple corresponding descriptions, and the sentence description of each video is Sentences={Sentence 1 ,...,Sentence N }, where N represents the sentence corresponding to each video (Sentence 1 ,..., Sentence N ) description number.

(3)通过现有的视频集合VID以及每个视频对应的句子描述Sentences组成<视频,描述>对,构成文本描述训练集。(3) The <video, description> pair is composed of the existing video collection VID and the sentence description Sentences corresponding to each video to form a text description training set.

202:利用现有的图像数据集,按照图像分类任务训练卷积神经网络(CNN)模型,训练CNN模型参数;202: Utilize the existing image data set, train a convolutional neural network (CNN) model according to the image classification task, and train CNN model parameters;

该步骤具体包括:This step specifically includes:

(1)构造图2中所示的AlexNet[1]CNN模型:该模型包括了8个网络层,其中前5层是卷积层,后3层是全连接层。(1) Construct the AlexNet[1]CNN model shown in Figure 2: This model includes 8 network layers, of which the first 5 layers are convolutional layers, and the last 3 layers are fully connected layers.

(2)使用Imagenet作为训练集,将图像数据集中的每一张图片采样到256*256大小的图片, I M A G E = { Image 1 , ... , Image N m } 作为输入,Nm为图片的个数,根据图2设置的网络层,第1层可表示为:(2) Use Imagenet as the training set, sample each picture in the image data set to a picture of 256*256 size, I m A G E. = { image 1 , ... , image N m } As input, N m is the picture The number of , according to the network layer set in Figure 2, the first layer can be expressed as:

F1(IMAGE)=norm{pool[max(0,W1*IMAGE+B1)]}(1)F 1 (IMAGE)=norm{pool[max(0,W 1 *IMAGE+B 1 )]}(1)

其中,IMAGE表示输入图像;W1表示卷积核参数;B1表示偏置;F1(IMAGE)表示为经过第一层网络后的输出结果;Norm表示归一化操作。在这一网络层中,通过线性纠正函数(max(0,x),x为W1*IMAGE+B1)对卷积后的图像进行处理,再经过映射pool操作,并对其进行局部相应归一化(LRN),其归一化的方式为:Among them, IMAGE represents the input image; W 1 represents the convolution kernel parameter; B 1 represents the bias; F 1 (IMAGE) represents the output result after the first layer of network; Norm represents the normalization operation. In this network layer, the convolved image is processed through the linear correction function (max(0,x), x is W 1 *IMAGE+B 1 ), and then the mapping pool operation is performed, and local corresponding Normalization (LRN), the normalization method is:

bb ii xx ,, ythe y == aa ii xx ,, ythe y // (( kk ++ &alpha;&alpha; &Sigma;&Sigma; jj == maxmax (( 00 ,, ii -- nno // 22 )) minmin (( Mm -- 11 ,, ii ++ nno // 22 )) (( aa jj xx ,, ythe y )) 22 )) &beta;&beta; -- -- -- (( 22 ))

其中,M为pooling之后特征映射的个数;i为M个特征映射中的第i个;n为局部归一化的大小,即每n个特征映射进行归一化;ai x,y表示在第i个特征映射中坐标(x,y)下所对应的值;k为偏置;α,β为归一化的参数;bi x,y为经过局部相应归一化(LRN)后的输出结果。Among them, M is the number of feature maps after pooling; i is the i-th of M feature maps; n is the size of local normalization, that is, every n feature maps are normalized; a i x, y means The value corresponding to the coordinate (x, y) in the i-th feature map; k is the bias; α, β are normalized parameters; b i x, y are after local corresponding normalization (LRN) output result.

在AlexNet中,k=2,n=5,α=10-4,β=0.75。In AlexNet, k=2, n=5, α=10 −4 , β=0.75.

继续采用该模型,将F1(IMAGE)作为第二个网络层的输入,根据第二层网络层,可表示为:Continuing to use this model, F 1 (IMAGE) is used as the input of the second network layer. According to the second network layer, it can be expressed as:

F2(IMAGE)=max(0,W2*F1(IMAGE)+B2)(3)F 2 (IMAGE)=max(0,W 2 *F 1 (IMAGE)+B 2 )(3)

其中,W2表示卷积核参数;B2表示偏置;F2(IMAGE)表示为经过第二层网络后的输出结果。第一层与第二层的设置相同,只是卷积层与pooling层的映射核kernel的大小发生变化。Among them, W 2 represents the convolution kernel parameter; B 2 represents the bias; F 2 (IMAGE) represents the output result after passing through the second layer network. The settings of the first layer and the second layer are the same, but the size of the mapping kernel kernel of the convolution layer and the pooling layer changes.

按照AlexNet的网络设置,剩余的卷积层可依次表示为:According to AlexNet's network settings, the remaining convolutional layers can be expressed in turn as:

F3(IMAGE)=max(0,W3*F2(IMAGE)+B3)(4)F 3 (IMAGE)=max(0,W 3 *F 2 (IMAGE)+B 3 )(4)

F4(IMAGE)=max(0,W4*F3(IMAGE)+B4)(5)F 4 (IMAGE)=max(0,W 4 *F 3 (IMAGE)+B 4 )(5)

F5(IMAGE)=pool[max(0,W5*F4(IMAGE)+B5)](6)F 5 (IMAGE)=pool[max(0,W 5 *F 4 (IMAGE)+B 5 )](6)

其中,W3,W4,W5以及B3,B4,B5为各层的卷积参数和偏置。Among them, W 3 , W 4 , W 5 and B 3 , B 4 , B 5 are the convolution parameters and biases of each layer.

后3层为全连接层,根据图2的网络层设置可依次表示为:The last three layers are fully connected layers, which can be expressed in sequence according to the network layer settings in Figure 2:

F6(IMAGE)=fc[F5(IMAGE),θ1](7)F 6 (IMAGE)=fc[F 5 (IMAGE),θ 1 ](7)

F7(IMAGE)=fc[F6(IMAGE),θ2](8)F 7 (IMAGE)=fc[F 6 (IMAGE),θ 2 ](8)

F8(IMAGE)=fc[F7(IMAGE),θ3](9)F 8 (IMAGE)=fc[F 7 (IMAGE),θ 3 ](9)

其中,fc表示全连接层,θ1,θ2,θ3表示三个全连接层的参数,并将最后一层的特征F8(IMAGE)输入到1000个类别的多元分类器进行分类。Among them, fc represents the fully connected layer, θ 1 , θ 2 , and θ 3 represent the parameters of the three fully connected layers, and the feature F 8 (IMAGE) of the last layer is input to a multi-class classifier with 1000 categories for classification.

(3)根据当前网络,设置多元分类器,其公式可表示为:(3) According to the current network, set a multivariate classifier, and its formula can be expressed as:

ll (( &Theta;&Theta; )) == &Sigma;&Sigma; tt == 11 mm loglog pp (( ythe y (( tt )) || xx (( tt )) ;; &Theta;&Theta; )) -- -- -- (( 1010 ))

其中,l(Θ)为目标函数,m为Imagenet中图像的类别,x(t)为每一类别经过Alexnet网络之后提取的CNN特征,y(t)为每个图像对应的标签,Θ={Wp,Bpq},p=1,...,5,q=1,2,3,分别为各个网络层中的参数。采用梯度下降的方法对目标函数参数进行优化,从而得到Alexnet网络设置的参数Θ。Among them, l(Θ) is the objective function, m is the category of the image in Imagenet, x (t) is the CNN feature extracted after each category passes through the Alexnet network, y (t) is the label corresponding to each image, Θ={ W p , B p , θ q }, p=1,...,5, q=1, 2, 3 are parameters in each network layer respectively. The gradient descent method is used to optimize the parameters of the objective function, so as to obtain the parameter Θ set by the Alexnet network.

203:对视频提取视频帧序列,并利用卷积神经网络(CNN)模型提取CNN特征,构成<视频帧序列,文本描述序列>对作为递归神经网络(RNN)模型的输入,训练得到递归神经网络(RNN)模型;203: Extract the video frame sequence from the video, and use the convolutional neural network (CNN) model to extract CNN features, form a <video frame sequence, text description sequence> pair as the input of the recurrent neural network (RNN) model, and train the recurrent neural network (RNN) model;

该步骤具体为:The steps are specifically:

(1)根据步骤201,使用训练CNN模型后的参数,提取图像的CNN特征I,以及图像对应的句子描述S进行建模,其目标函数为:(1) According to step 201, use the parameters after training the CNN model to extract the CNN feature I of the image, and the sentence description S corresponding to the image to model, and its objective function is:

θ*=argmax∑logp(S|I;θ)θ * = argmax∑logp(S|I; θ)

(11)(11)

其中,(S,I)代表训练数据中的图像-文本对;θ为模型待优化参数;θ*为优化后的参数;Among them, (S, I) represents the image-text pair in the training data; θ is the model parameter to be optimized; θ* is the optimized parameter;

训练的目的是使得所有样本在给定输入图像I的观察下生成的句子的对数概率之和最大,采用条件概率的链式法则计算概率p(S|I;θ),表达式为:The purpose of the training is to maximize the sum of the logarithmic probabilities of the sentences generated by all samples under the observation of the given input image I. The chain rule of conditional probability is used to calculate the probability p(S|I; θ), the expression is:

loglog pp (( SS || II )) == &Sigma;&Sigma; tt == 00 NN loglog pp (( SS tt || II ,, SS 00 ,, SS 11 ,, ...... ,, SS tt -- 11 )) -- -- -- (( 1212 ))

其中,S0,S1,...,St-1,St表示句子中的单词。对公式中的未知量p(St|I,S0,S1,...,St-1)使用递归神经网络进行建模。Wherein, S 0 , S 1 ,...,S t-1 , S t represent words in the sentence. The unknown quantity p(S t |I,S 0 ,S 1 ,...,S t-1 ) in the formula is modeled using a recurrent neural network.

(2)构造递归神经网络(RNN):(2) Construct a recurrent neural network (RNN):

在t-1个单词作为条件下,并将这些词表示为固定长度的隐态ht,直到出现新的输入xt,并通过非线性函数f对隐态进行更新,表达式为:Under the condition of t-1 words, and represent these words as a fixed-length hidden state h t until a new input x t appears, and update the hidden state through the nonlinear function f, the expression is:

ht+1=f(ht,xt)(13)h t+1 =f(h t ,x t )(13)

其中,ht+1表示下一隐态。Among them, h t+1 represents the next hidden state.

(3)对于非线性函数f,通过构造如图3所示的长短时间记忆网络(LSTM)进行建模;(3) For the nonlinear function f, model it by constructing a long-short-term memory network (LSTM) as shown in Figure 3;

其中,it为输入门inputgate,ft为遗忘门forgetgate,ot为输出门outputgate,c为细胞cell,各个状态的更新和输出可表示为:Among them, it is the input gate input gate, f t is the forget gate forgetgate, o t is the output gate output gate, c is the cell cell, the update and output of each state can be expressed as:

it=σ(Wixxt+Wimmt-1)(14)i t =σ(W ix x t +W im m t-1 )(14)

ft=σ(Wfxxt+Wfmmt-1)(15)f t = σ(W fx x t +W fm m t-1 )(15)

ot=σ(Woxxt+Wommt-1)(16)o t = σ(W ox x t +W om m t-1 )(16)

pt+1=Softmax(mt)(19)p t+1 =Softmax(m t )(19)

其中,表示为gate值之间的乘积,矩阵W={Wix;Wim;Wfx;Wfm;Wox;Wom;Wcx;Wix;Wcm}为需要训练的参数,σ(·)为S型函数(例如:σ(Wixxt+Wimmt-1)、σ(Wfxxt+Wfmmt-1)为S型函数),h(·)为双曲线正切函数(例如:h(Wcxxt+Wcmmt-1)为双曲线正切函数)。pt+1为经过Softmax分类后下一个词的概率分布;mt为当前状态特征。in, Expressed as the product between gate values, the matrix W={W ix ; W im ; W fx ; W fm ; W ox ; W om ; W cx ; W ix ; W cm } is the parameter to be trained, σ( ) is a Sigmoid function (for example: σ(W ix x t +W im m t-1 ), σ(W fx x t +W fm m t-1 ) is a Sigmoid function), h( ) is the hyperbolic tangent function (for example: h(W cx x t +W cm m t-1 ) is a hyperbolic tangent function). p t+1 is the probability distribution of the next word after Softmax classification; m t is the current state feature.

(4)使用梯度下降的方式优化目标函数(11),并得到训练后的长短时间记忆网络LSTM参数W。(4) Optimize the objective function (11) by gradient descent, and obtain the trained long-short-term memory network LSTM parameter W.

204:利用训练得到的RNN模型对待描述视频的视频帧序列进行描述,得到描述序列,进行预测的步骤如下:204: Using the trained RNN model to describe the video frame sequence of the video to be described, to obtain the description sequence, the steps for prediction are as follows:

(1)提取测试集 VID t = { Video t 1 , ... , Video t N t } , Nt为测试集视频的个数,t为测试集视频,并对每一个视频提取10帧图像,可表示为: Image t = { Image t 1 , ... , Image t 10 } . (1) extraction test set VID t = { video t 1 , ... , video t N t } , N t is the number of video in the test set, t is the video in the test set, and 10 frames of images are extracted for each video, which can be expressed as: image t = { image t 1 , ... , image t 10 } .

(2)利用训练好的模型参数Θ={Wi,Bij},i=1,...,5,j=1,2,3,并使用CNN模型提取Imaget中每个图像的CNN特征,得到图像特征It={It 1,…,It 10}。(2) Use the trained model parameters Θ={W i ,B ij }, i=1,...,5, j =1,2,3, and use the CNN model to extract each The CNN feature of the image, the image feature I t ={I t 1 ,...,I t 10 } is obtained.

(3)将图像特征It作为输入并利用训练得到的模型参数W求得公式(12),得到句子描述S={S1,…,Sn}。从而得到该视频对应的句子描述。(3) The image feature I t is used as an input and the model parameter W obtained from training is used to obtain the formula (12), and the sentence description S={S 1 ,...,S n } is obtained. Thus, the sentence description corresponding to the video is obtained.

205:利用LexRank的方法对描述序列的合理性进行排序,选择最合理描述作为对该视频的最终描述。205: Use the method of LexRank to sort the rationality of the description sequence, and select the most reasonable description as the final description of the video.

(1)通过使用RNN模型对视频特征序列It={It 1,…,It 10}进行测试,生成相应的句子集合S={S1,…Si,…,Sn}。(1) Test the video feature sequence I t ={I t 1 ,...,I t 10 } by using the RNN model, and generate a corresponding sentence set S={S 1 ,...S i ,...,S n }.

(2)生成句子特征,顺序扫描所有句子集合中S中每一个句子Si中的所有单词,其中i=1,…,Nd,每个不同单词保留一个,形成单词列表表示的词汇表VOL={wi,…,wNw},其中Nw是词汇表VOL中的单词总数。对词汇表VOL中的每个单词wi,顺序扫描句子集合S中的每句子Sj,统计每个单词wi在每个句子Sj中出现的次数nij,其中j=1,…,Ns,Ns是句子总数,并统计集合S中包含单词wi的伴随文本个数num(wi);根据公式(20)计算每个单词wi在每个句子Sj中的词频tf(wi,sj),其中i=1,…,Nd,Nd是词汇表中单词总数,j=1,…,Ns,Ns是集合中所有句子S总数;(2) Generate sentence features, sequentially scan all words in each sentence S i in S in all sentence sets, where i=1,...,N d , keep one for each different word, and form a vocabulary VOL represented by a word list ={w i ,...,w Nw }, where N w is the total number of words in the vocabulary VOL. For each word w i in the vocabulary VOL, sequentially scan each sentence S j in the sentence set S, and count the number of occurrences n ij of each word w i in each sentence S j , where j=1,..., N s , N s is the total number of sentences, and count the number of accompanying texts num(w i ) containing word w i in the set S; calculate the word frequency tf of each word w i in each sentence S j according to formula (20) (w i , s j ), where i=1,...,N d , N d is the total number of words in the vocabulary, j=1,...,N s , N s is the total number of all sentences S in the collection;

tt ff (( ww ii ,, sthe s jj )) == nno ii jj // &Sigma;&Sigma; kk == 11 NN ww nno kk jj -- -- -- (( 2020 ))

其中,nkj为第k个词在第j个句子中出现的个数。Among them, n kj is the number of occurrences of the kth word in the jth sentence.

对词汇表VOL中的每个单词wi,根据公式(21)计算其逆文档词频idf(wi);For each word w i in the vocabulary VOL, calculate its inverse document word frequency idf(w i ) according to formula (21);

idf(wi)=log(Nd/num(wi))(21)idf(w i )=log(N d /num(w i ))(21)

其中,Nd为每个句子单词的个数。Among them, N d is the number of words in each sentence.

根据向量空间模型,将集合S中每个句子Sj表示成Nw维向量,第i维对应词汇表中的单词wi,其值为tfidf(wi),计算公式如下:According to the vector space model, each sentence S j in the set S is expressed as an N w -dimensional vector, the i-th dimension corresponds to the word w i in the vocabulary, and its value is tfidf(w i ), the calculation formula is as follows:

tfidf(wi)=tf(wi,sj)×idf(wi)(22)tfidf(w i )=tf(w i ,s j )×idf(w i )(22)

(3)采用两个向量Si,Sj之间的余弦值作为句子相似度,计算公式如下:(3) The cosine value between the two vectors S i and S j is used as the sentence similarity, and the calculation formula is as follows:

sthe s ii mm ll ii aa rr ii tt ythe y (( SS ii ,, SS jj )) == &Sigma;&Sigma; ww &Element;&Element; SS ii ,, SS jj tftf ww ,, SS ii tftf ww ,, SS jj (( idfidf ww )) 22 &Sigma;&Sigma; sthe s mm &Element;&Element; SS ii (( tftf sthe s mm ,, SS ii idfidf sthe s mm )) 22 &times;&times; &Sigma;&Sigma; ii nno &Element;&Element; SS jj (( tftf sthe s nno ,, SS jj idfidf sthe s nno )) 22 -- -- -- (( 23twenty three ))

其中,为每个单词w在句子Si中的词频;为每个单词w在句子Sj中的词频;idfw为每个单词的逆文档词频;sm为句子Si中任意一个单词;为单词sm在Si中的词频;为单词sm的逆文档词频;sn为句子Sj中任意一个单词;为单词sn在Sj中的词频;为单词sn的逆文档词频。in, is the word frequency of each word w in sentence S i ; is the word frequency of each word w in the sentence S j ; idf w is the inverse document word frequency of each word; s m is any word in the sentence S i ; is the word frequency of word s m in S i ; is the inverse document word frequency of the word s m ; s n is any word in the sentence S j ; is the word frequency of word s n in S j ; is the inverse document frequency of word s n .

并形成全连接无向图,如图4(a),每个节点ui为句子Si,节点之间边作为为句子相似度。And form a fully connected undirected graph, as shown in Figure 4(a), each node u i is a sentence S i , and the edges between nodes are taken as sentence similarity.

(4)设置阈值Degree,将所有相似度similarity小于Degree的边删除,如图4(b)。(4) Set the threshold Degree, and delete all edges whose similarity is less than Degree, as shown in Figure 4(b).

(5)计算每个句子节点ui的LexRank分数LR,每个句子节点的初始分数为:d/N,其中N为句子节点个数,d为阻尼因子,d通常选在[0.1,0.2]之间,根据公式(4)计算分数LR:(5) Calculate the LexRank score LR of each sentence node u i , the initial score of each sentence node is: d/N, where N is the number of sentence nodes, d is the damping factor, and d is usually selected in [0.1,0.2] Between, the score LR is calculated according to the formula (4):

LL RR (( uu )) == (( 11 -- dd )) &Sigma;&Sigma; vv &Element;&Element; aa dd jj &lsqb;&lsqb; uu &rsqb;&rsqb; LL RR (( vv )) degdeg (( vv )) ++ dd NN -- -- -- (( 24twenty four ))

其中,deg(v)为节点v的阈值;LR(u)为节点u的分数;LR(v)为节点v的分数。Among them, deg(v) is the threshold of node v; LR(u) is the score of node u; LR(v) is the score of node v.

(6)计算每个句子节点的LR分数,并排序,选择分数最高的句子作为视频的最终描述。(6) Calculate the LR score of each sentence node, and sort them, and select the sentence with the highest score as the final description of the video.

综上所述,本发明实施例通过步骤201-步骤205实现了通过自然语言描述一段视频中正在发生的事件以及与事件相关的物体属性,从而达到对视频内容进行描述和总结的目的。To sum up, the embodiment of the present invention realizes the description of the event occurring in a video and the object attributes related to the event through natural language through steps 201 to 205, so as to achieve the purpose of describing and summarizing the video content.

实施例3Example 3

这里选取两个视频作为待描述视频,如图5所示,使用本发明中基于深度学习和文本总结的方法对其进行预测输出相应的视频描述:Here, two videos are selected as videos to be described, as shown in Figure 5, using the method based on deep learning and text summarization in the present invention to predict and output corresponding video descriptions:

(1)使用ImageNet作为训练集,将数据集中的每一张图片采样到256*256大小的图片, I M A G E = { Image 1 , ... , Image N m } 作为输入,Nm为图片的个数。(1) Use ImageNet as the training set, sample each picture in the data set to a picture of 256*256 size, I m A G E. = { image 1 , ... , image N m } As input, N m is the number of pictures.

(2)搭建第一层卷积层,设置卷积核cov1大小为11,步长stride为4,选择ReLU为max(0,x),对卷积后的featuremap进行pooling操作,核的大小为3,步长stride为2,并使用局部相应归一化对卷积后的数据进行归一化。在AlexNet中,k=2,n=5,α=10-4,β=0.75。(2) Build the first convolutional layer, set the convolution kernel cov1 size to 11, stride to 4, select ReLU as max(0,x), perform pooling operation on the convolutional featuremap, and the kernel size is 3. The stride is 2, and the convolutional data is normalized using local corresponding normalization. In AlexNet, k=2, n=5, α=10 −4 , β=0.75.

(3)搭建第二层卷积层,设置卷积核cov2大小为5,步长stride为1,选择ReLU为max(0,x),对卷积后的featuremap进行pooling操作,核的大小为3,步长stride为2,并使用局部相应归一化对卷积后的数据进行归一化。(3) Build the second convolutional layer, set the convolution kernel cov2 size to 5, stride to 1, select ReLU as max(0,x), perform pooling operation on the convolutional featuremap, and the kernel size is 3. The stride is 2, and the convolutional data is normalized using local corresponding normalization.

(4)搭建第三层卷积层,设置卷积核cov3大小为3,步长stride为1,选择ReLU为max(0,x)。(4) Build the third convolutional layer, set the convolution kernel cov3 size to 3, stride to 1, and select ReLU as max(0,x).

(5)搭建第四层卷积层,设置卷积核cov4大小为3,步长stride为1,选择ReLU为max(0,x)。(5) Build the fourth convolution layer, set the convolution kernel cov4 size to 3, stride to 1, and select ReLU to max(0,x).

(6)搭建第五层卷积层,设置卷积核cov5大小为3,步长stride为1,选择ReLU为max(0,x),并对卷积后的featuremap进行pooling操作,核的大小为3,步长stride为2。(6) Build the fifth convolutional layer, set the convolution kernel cov5 size to 3, stride to 1, select ReLU as max(0,x), and perform pooling operation on the convolutional featuremap, the size of the kernel is 3, and the stride is 2.

(7)搭建第六层全连接层,设置该层为fc6,选择ReLU为max(0,x),对处理后的数据进行dropout。(7) Build the sixth fully connected layer, set this layer as fc6, select ReLU as max(0,x), and perform dropout on the processed data.

(8)搭建第七层全连接层,设置该层为fc7,选择ReLU为max(0,x),对处理后的数据进行dropout。(8) Build the seventh fully connected layer, set this layer as fc7, select ReLU as max(0,x), and perform dropout on the processed data.

(9)搭建第八层全连接层,设置该层为fc8,并加入Softmax分类器作为目标函数。(9) Build the eighth fully connected layer, set this layer as fc8, and add the Softmax classifier as the objective function.

(10)通过设置上述八层网络层,建立卷积神经网络(CNN)模型。(10) Establish a convolutional neural network (CNN) model by setting the above-mentioned eight-layer network layer.

(11)训练CNN模型参数。(11) Training CNN model parameters.

(12)数据处理:将数据集中的每个视频均匀提取10帧,并采样到256*256大小。并将图像输入到训练好的CNN模型中得到图像特征,每帧图像随机对应该视频的5句文本表述作为图像-文本对(12) Data processing: 10 frames are evenly extracted from each video in the data set, and sampled to a size of 256*256. And input the image into the trained CNN model to obtain image features, and each frame of image randomly corresponds to the 5 sentences of the video as an image-text pair

(13)构造递归神经网络(RNN)模型。(13) Construct a recurrent neural network (RNN) model.

图5为经过本发明后所产生的视频文本描述结果。图中的图片部分为从视频中提取的视频帧,每帧图像对应的句子为视频特征经过语言模型后所得到的结果。图片下半部分表示经过总结后,只采用视频特征和通过图像迁移所生成的句子以及该视频原本的描述。Fig. 5 is the video text description result generated by the present invention. The picture part in the figure is the video frame extracted from the video, and the sentence corresponding to each frame of image is the result obtained after the video features pass through the language model. The lower part of the picture shows that after summarization, only the video features and sentences generated by image migration and the original description of the video are used.

综上所述,本发明实施例将每一个视频的帧序列通过卷积神经网络和循环神经网络转化成一系列的句子,并通过文本总结的方法,从众多的句子中筛选出质量高并具有代表性的句子。用户可以使用这种方法得到视频的描述,其描述的准确性较高,并且可以推广到视频的检索中去。In summary, the embodiment of the present invention converts the frame sequence of each video into a series of sentences through the convolutional neural network and the cyclic neural network, and selects high-quality and representative sentences from a large number of sentences through the method of text summarization. sex sentences. Users can use this method to obtain video descriptions, which have high accuracy and can be extended to video retrieval.

参考文献references

[1]KrizhevskyA,SutskeverI,HintonG.基于深度卷积神经网络的图像分类方法[J].神经信息处理系统进展,2012.[1] KrizhevskyA, SutskeverI, HintonG. Image classification method based on deep convolutional neural network [J]. Neural Information Processing System Progress, 2012.

本领域技术人员可以理解附图只是一个优选实施例的示意图,上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the serial numbers of the above-mentioned embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims (4)

1.一种基于深度学习和文本总结的视频描述方法,其特征在于,所述视频描述方法包括以下步骤:1. A video description method based on deep learning and text summary, characterized in that, the video description method comprises the following steps: 从互联网下载视频,并对每个视频进行描述,形成<视频,描述>对,构成文本描述训练集;Download videos from the Internet, and describe each video to form a <video, description> pair to form a text description training set; 通过现有的图像数据集按照图像分类任务训练卷积神经网络模型;Train the convolutional neural network model according to the image classification task through the existing image data set; 对视频提取视频帧序列,并利用卷积神经网络模型提取卷积神经网络特征,构成<视频帧序列,文本描述序列>对作为递归神经网络模型的输入,训练得到递归神经网络模型;Extract video frame sequence to video, and utilize convolutional neural network model to extract convolutional neural network feature, form <video frame sequence, text description sequence> pair as the input of recurrent neural network model, train and obtain recursive neural network model; 通过训练得到的递归神经网络模型对待描述视频的视频帧序列进行描述,得到描述序列;Describe the video frame sequence of the video to be described by the recursive neural network model obtained through training, and obtain the description sequence; 通过基于图的词汇中心度作为文本总结的显著性的方法,对描述序列进行排序,输出视频的最终描述结果。By using graph-based lexical centrality as a saliency method for text summarization, the sequence of descriptions is sorted, and the final description result of the video is output. 2.根据权利要求1所述的一种基于深度学习和文本总结的视频描述方法,其特征在于,所述从互联网下载视频,并对每个视频进行描述,形成<视频,描述>对,构成文本描述训练集具体为:2. A kind of video description method based on deep learning and text summary according to claim 1, characterized in that, downloading videos from the Internet, and describing each video, forming <video, description> pair, constitute The text description training set is specifically: 通过现有的视频集合、以及每个视频对应的句子描述组成<视频,描述>对,构成文本描述训练集。The <video, description> pair is composed of the existing video collection and the sentence description corresponding to each video to form a text description training set. 3.根据权利要求1所述的一种基于深度学习和文本总结的视频描述方法,其特征在于,所述对视频提取视频帧序列,并利用卷积神经网络模型提取卷积神经网络特征,构成<视频帧序列,文本描述序列>对作为递归神经网络模型的输入,训练得到递归神经网络模型的步骤具体为:3. A kind of video description method based on deep learning and text summary according to claim 1, characterized in that, the video extracts a video frame sequence, and utilizes a convolutional neural network model to extract convolutional neural network features to form <Video frame sequence, text description sequence> For the input of the recurrent neural network model, the steps to train the recurrent neural network model are as follows: 使用训练卷积神经网络模型后的参数,提取图像的卷积神经网络特征,以及图像对应的句子描述进行建模,获取目标函数;Use the parameters after training the convolutional neural network model to extract the convolutional neural network features of the image and the sentence description corresponding to the image for modeling to obtain the objective function; 构造递归神经网络;对于非线性函数通过长短时间记忆网络进行建模;Construct a recurrent neural network; model nonlinear functions through long and short-term memory networks; 使用梯度下降的方式优化目标函数,并得到训练后的长短时间记忆网络参数。Use gradient descent to optimize the objective function and obtain the trained long-short-term memory network parameters. 4.根据权利要求1所述的一种基于深度学习和文本总结的视频描述方法,其特征在于,所述通过训练得到的递归神经网络模型对待描述视频的视频帧序列进行描述,得到描述序列的步骤具体为:4. A kind of video description method based on deep learning and text summary according to claim 1, characterized in that, the recursive neural network model obtained through training describes the video frame sequence of the video to be described, and obtains the description sequence The specific steps are: 利用训练好的模型参数并使用卷积神经网络模型提取每个图像的卷积神经网络特征,得到图像特征;Use the trained model parameters and use the convolutional neural network model to extract the convolutional neural network features of each image to obtain image features; 将图像特征作为输入并利用训练得到的模型参数得到句子描述,从而得到视频对应的句子描述。The image feature is used as input and the model parameters obtained by training are used to obtain a sentence description, so as to obtain a sentence description corresponding to the video.
CN201510697454.3A 2015-10-23 2015-10-23 A video description method based on deep learning and text summarization Active CN105279495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510697454.3A CN105279495B (en) 2015-10-23 2015-10-23 A video description method based on deep learning and text summarization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510697454.3A CN105279495B (en) 2015-10-23 2015-10-23 A video description method based on deep learning and text summarization

Publications (2)

Publication Number Publication Date
CN105279495A true CN105279495A (en) 2016-01-27
CN105279495B CN105279495B (en) 2019-06-04

Family

ID=55148479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510697454.3A Active CN105279495B (en) 2015-10-23 2015-10-23 A video description method based on deep learning and text summarization

Country Status (1)

Country Link
CN (1) CN105279495B (en)

Cited By (66)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105894043A (en) * 2016-04-27 2016-08-24 上海高智科技发展有限公司 Method and system for generating video description sentences
CN106126492A (en) * 2016-06-07 2016-11-16 北京高地信息技术有限公司 Statement recognition methods based on two-way LSTM neutral net and device
CN106227793A (en) * 2016-07-20 2016-12-14 合网络技术(北京)有限公司 A kind of video and the determination method and device of Video Key word degree of association
CN106372107A (en) * 2016-08-19 2017-02-01 中兴通讯股份有限公司 Generation method and device of natural language sentence library
CN106485251A (en) * 2016-10-08 2017-03-08 天津工业大学 Egg embryo classification based on deep learning
CN106503055A (en) * 2016-09-27 2017-03-15 天津大学 A kind of generation method from structured text to iamge description
CN106599198A (en) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method for multi-stage connection recurrent neural network
CN106650789A (en) * 2016-11-16 2017-05-10 同济大学 Image description generation method based on depth LSTM network
CN106650756A (en) * 2016-12-28 2017-05-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image text description method based on knowledge transfer multi-modal recurrent neural network
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106845411A (en) * 2017-01-19 2017-06-13 清华大学 A kind of video presentation generation method based on deep learning and probability graph model
CN106886768A (en) * 2017-03-02 2017-06-23 杭州当虹科技有限公司 A kind of video fingerprinting algorithms based on deep learning
CN106934352A (en) * 2017-02-28 2017-07-07 华南理工大学 A kind of video presentation method based on two-way fractal net work and LSTM
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
CN107203598A (en) * 2017-05-08 2017-09-26 广州智慧城市发展研究院 A kind of method and system for realizing image switch labels
WO2017168252A1 (en) * 2016-03-31 2017-10-05 Maluuba Inc. Method and system for processing an input query
CN107291882A (en) * 2017-06-19 2017-10-24 江苏软开信息科技有限公司 A kind of data automatic statistical analysis method
CN107292086A (en) * 2016-04-07 2017-10-24 西门子保健有限责任公司 Graphical analysis question and answer
CN107368887A (en) * 2017-07-25 2017-11-21 江西理工大学 A kind of structure and its construction method of profound memory convolutional neural networks
CN107391505A (en) * 2016-05-16 2017-11-24 腾讯科技(深圳)有限公司 A kind of image processing method and system
CN107515900A (en) * 2017-07-24 2017-12-26 宗晖(上海)机器人有限公司 Intelligent robot and its event memorandum system and method
CN107578062A (en) * 2017-08-19 2018-01-12 四川大学 A Image Caption Method Based on Attribute Probability Vector Guided Attention Patterns
CN107609501A (en) * 2017-09-05 2018-01-19 东软集团股份有限公司 The close action identification method of human body and device, storage medium, electronic equipment
CN107707931A (en) * 2016-08-08 2018-02-16 阿里巴巴集团控股有限公司 Generated according to video data and explain data, data synthesis method and device, electronic equipment
CN107784372A (en) * 2016-08-24 2018-03-09 阿里巴巴集团控股有限公司 Forecasting Methodology, the device and system of destination object attribute
CN107818306A (en) * 2017-10-31 2018-03-20 天津大学 A kind of video answering method based on attention model
CN107844751A (en) * 2017-10-19 2018-03-27 陕西师范大学 The sorting technique of guiding filtering length Memory Neural Networks high-spectrum remote sensing
CN108200483A (en) * 2017-12-26 2018-06-22 中国科学院自动化研究所 Dynamically multi-modal video presentation generation method
CN108228686A (en) * 2017-06-15 2018-06-29 北京市商汤科技开发有限公司 It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN108307229A (en) * 2018-02-02 2018-07-20 新华智云科技有限公司 A kind of processing method and equipment of video-audio data
CN108491208A (en) * 2018-01-31 2018-09-04 中山大学 A kind of code annotation sorting technique based on neural network model
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN108665055A (en) * 2017-03-28 2018-10-16 上海荆虹电子科技有限公司 A kind of figure says generation method and device
CN108683924A (en) * 2018-05-30 2018-10-19 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
CN108734614A (en) * 2017-04-13 2018-11-02 腾讯科技(深圳)有限公司 Traffic congestion prediction technique and device, storage medium
CN108765383A (en) * 2018-03-22 2018-11-06 山西大学 Video presentation method based on depth migration study
CN108830287A (en) * 2018-04-18 2018-11-16 哈尔滨理工大学 The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN108881950A (en) * 2018-05-30 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
WO2019024083A1 (en) * 2017-08-04 2019-02-07 Nokia Technologies Oy Artificial neural network
CN109522451A (en) * 2018-12-13 2019-03-26 连尚(新昌)网络科技有限公司 Repeat video detecting method and device
CN109522531A (en) * 2017-09-18 2019-03-26 腾讯科技(北京)有限公司 Official documents and correspondence generation method and device, storage medium and electronic device
CN109711022A (en) * 2018-12-17 2019-05-03 哈尔滨工程大学 A submarine anti-sinking system based on deep learning
CN109891897A (en) * 2016-10-27 2019-06-14 诺基亚技术有限公司 Method for analyzing media content
CN109960747A (en) * 2019-04-02 2019-07-02 腾讯科技(深圳)有限公司 The generation method of video presentation information, method for processing video frequency, corresponding device
CN110019952A (en) * 2017-09-30 2019-07-16 华为技术有限公司 Video presentation method, system and device
CN110119750A (en) * 2018-02-05 2019-08-13 浙江宇视科技有限公司 Data processing method, device and electronic equipment
CN110188779A (en) * 2019-06-03 2019-08-30 中国矿业大学 A Method for Generating Image Semantic Description
CN110210499A (en) * 2019-06-03 2019-09-06 中国矿业大学 A kind of adaptive generation system of image, semantic description
US10445871B2 (en) 2017-05-22 2019-10-15 General Electric Company Image analysis neural network systems
CN110612537A (en) * 2017-05-02 2019-12-24 柯达阿拉里斯股份有限公司 System and method for batch normalized loop highway network
CN110678816A (en) * 2017-04-04 2020-01-10 西门子股份公司 Method and control device for controlling a technical system
CN110765921A (en) * 2019-10-18 2020-02-07 北京工业大学 A video object localization method based on weakly supervised learning and video spatiotemporal features
CN110781345A (en) * 2019-10-31 2020-02-11 北京达佳互联信息技术有限公司 Video description generation model acquisition method, video description generation method and device
CN111325068A (en) * 2018-12-14 2020-06-23 北京京东尚科信息技术有限公司 Video description method and device based on convolutional neural network
CN111404676A (en) * 2020-03-02 2020-07-10 北京丁牛科技有限公司 Method and device for generating, storing and transmitting secure and secret key and cipher text
CN111400545A (en) * 2020-03-01 2020-07-10 西北工业大学 A video annotation method based on deep learning
CN111461974A (en) * 2020-02-17 2020-07-28 天津大学 Image scanning path control method based on L STM model from coarse to fine
CN111488807A (en) * 2020-03-29 2020-08-04 复旦大学 Video description generation system based on graph convolution network
CN111681676A (en) * 2020-06-09 2020-09-18 杭州星合尚世影视传媒有限公司 Method, system and device for identifying and constructing audio frequency by video object and readable storage medium
WO2020220702A1 (en) * 2019-04-29 2020-11-05 北京三快在线科技有限公司 Generation of natural language
CN111931690A (en) * 2020-08-28 2020-11-13 Oppo广东移动通信有限公司 Model training method, device, equipment and storage medium
WO2021056750A1 (en) * 2019-09-29 2021-04-01 北京市商汤科技开发有限公司 Search method and device, and storage medium
CN113191262A (en) * 2021-04-29 2021-07-30 桂林电子科技大学 Video description data processing method, device and storage medium
CN113641854A (en) * 2021-07-28 2021-11-12 上海影谱科技有限公司 Method and system for converting characters into video
US11328512B2 (en) 2019-09-30 2022-05-10 Wipro Limited Method and system for generating a text summary for a multimedia content
CN119011953A (en) * 2024-09-14 2024-11-22 广州九微信息科技有限公司 Video-on-demand and audio-frequency service system and method based on cloud computing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442927B2 (en) * 2009-07-30 2013-05-14 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442927B2 (en) * 2009-07-30 2013-05-14 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUNES ERKAN: "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization", 《JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH》 *
SUBHASHINI VENUGOPALAN等: "Translating Videos to Natural Language Using Deep Recurrent Neural Networks", 《COMPUTER SCIENCE》 *

Cited By (107)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437929B2 (en) 2016-03-31 2019-10-08 Maluuba Inc. Method and system for processing an input query using a forward and a backward neural network specific to unigrams
WO2017168252A1 (en) * 2016-03-31 2017-10-05 Maluuba Inc. Method and system for processing an input query
CN107292086A (en) * 2016-04-07 2017-10-24 西门子保健有限责任公司 Graphical analysis question and answer
CN105894043A (en) * 2016-04-27 2016-08-24 上海高智科技发展有限公司 Method and system for generating video description sentences
CN107391505A (en) * 2016-05-16 2017-11-24 腾讯科技(深圳)有限公司 A kind of image processing method and system
CN107391505B (en) * 2016-05-16 2020-10-23 腾讯科技(深圳)有限公司 Image processing method and system
CN106126492A (en) * 2016-06-07 2016-11-16 北京高地信息技术有限公司 Statement recognition methods based on two-way LSTM neutral net and device
CN106126492B (en) * 2016-06-07 2019-02-05 北京高地信息技术有限公司 Sentence recognition methods and device based on two-way LSTM neural network
CN106227793A (en) * 2016-07-20 2016-12-14 合网络技术(北京)有限公司 A kind of video and the determination method and device of Video Key word degree of association
CN106227793B (en) * 2016-07-20 2019-10-22 优酷网络技术(北京)有限公司 A kind of determination method and device of video and the Video Key word degree of correlation
CN107707931A (en) * 2016-08-08 2018-02-16 阿里巴巴集团控股有限公司 Generated according to video data and explain data, data synthesis method and device, electronic equipment
CN106372107B (en) * 2016-08-19 2020-01-17 中兴通讯股份有限公司 Method and device for generating natural language sentence library
CN106372107A (en) * 2016-08-19 2017-02-01 中兴通讯股份有限公司 Generation method and device of natural language sentence library
CN107784372A (en) * 2016-08-24 2018-03-09 阿里巴巴集团控股有限公司 Forecasting Methodology, the device and system of destination object attribute
CN107784372B (en) * 2016-08-24 2022-02-22 阿里巴巴集团控股有限公司 Target object attribute prediction method, device and system
CN106503055B (en) * 2016-09-27 2019-06-04 天津大学 A Generating Method from Structured Text to Image Descriptions
CN106503055A (en) * 2016-09-27 2017-03-15 天津大学 A kind of generation method from structured text to iamge description
CN106485251B (en) * 2016-10-08 2019-12-24 天津工业大学 Classification of egg embryos based on deep learning
CN106485251A (en) * 2016-10-08 2017-03-08 天津工业大学 Egg embryo classification based on deep learning
CN109891897B (en) * 2016-10-27 2021-11-05 诺基亚技术有限公司 Method for analyzing media content
US11068722B2 (en) 2016-10-27 2021-07-20 Nokia Technologies Oy Method for analysing media content to generate reconstructed media content
CN109891897A (en) * 2016-10-27 2019-06-14 诺基亚技术有限公司 Method for analyzing media content
CN106650789B (en) * 2016-11-16 2023-04-07 同济大学 Image description generation method based on depth LSTM network
CN106650789A (en) * 2016-11-16 2017-05-10 同济大学 Image description generation method based on depth LSTM network
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106599198B (en) * 2016-12-14 2021-04-06 广东顺德中山大学卡内基梅隆大学国际联合研究院 An image description method based on multi-level connection recurrent neural network
CN106599198A (en) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method for multi-stage connection recurrent neural network
CN106650756B (en) * 2016-12-28 2019-12-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 knowledge migration-based image text description method of multi-mode recurrent neural network
CN106650756A (en) * 2016-12-28 2017-05-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image text description method based on knowledge transfer multi-modal recurrent neural network
CN106845411B (en) * 2017-01-19 2020-06-30 清华大学 Video description generation method based on deep learning and probability map model
CN106845411A (en) * 2017-01-19 2017-06-13 清华大学 A kind of video presentation generation method based on deep learning and probability graph model
CN106934352A (en) * 2017-02-28 2017-07-07 华南理工大学 A kind of video presentation method based on two-way fractal net work and LSTM
CN106886768A (en) * 2017-03-02 2017-06-23 杭州当虹科技有限公司 A kind of video fingerprinting algorithms based on deep learning
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
CN108665055B (en) * 2017-03-28 2020-10-23 深圳荆虹科技有限公司 Method and device for generating graphic description
CN108665055A (en) * 2017-03-28 2018-10-16 上海荆虹电子科技有限公司 A kind of figure says generation method and device
US10983485B2 (en) 2017-04-04 2021-04-20 Siemens Aktiengesellschaft Method and control device for controlling a technical system
CN110678816A (en) * 2017-04-04 2020-01-10 西门子股份公司 Method and control device for controlling a technical system
CN110678816B (en) * 2017-04-04 2021-02-19 西门子股份公司 Method and control device for controlling a technical system
CN108734614A (en) * 2017-04-13 2018-11-02 腾讯科技(深圳)有限公司 Traffic congestion prediction technique and device, storage medium
CN110612537A (en) * 2017-05-02 2019-12-24 柯达阿拉里斯股份有限公司 System and method for batch normalized loop highway network
CN107203598A (en) * 2017-05-08 2017-09-26 广州智慧城市发展研究院 A kind of method and system for realizing image switch labels
US10445871B2 (en) 2017-05-22 2019-10-15 General Electric Company Image analysis neural network systems
CN108228686A (en) * 2017-06-15 2018-06-29 北京市商汤科技开发有限公司 It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN108228686B (en) * 2017-06-15 2021-03-23 北京市商汤科技开发有限公司 Method and device for realizing image-text matching and electronic equipment
CN107291882A (en) * 2017-06-19 2017-10-24 江苏软开信息科技有限公司 A kind of data automatic statistical analysis method
CN107515900A (en) * 2017-07-24 2017-12-26 宗晖(上海)机器人有限公司 Intelligent robot and its event memorandum system and method
CN107368887A (en) * 2017-07-25 2017-11-21 江西理工大学 A kind of structure and its construction method of profound memory convolutional neural networks
CN107368887B (en) * 2017-07-25 2020-08-07 江西理工大学 A device for deep memory convolutional neural network and its construction method
US11481625B2 (en) 2017-08-04 2022-10-25 Nokia Technologies Oy Artificial neural network
WO2019024083A1 (en) * 2017-08-04 2019-02-07 Nokia Technologies Oy Artificial neural network
CN107578062A (en) * 2017-08-19 2018-01-12 四川大学 A Image Caption Method Based on Attribute Probability Vector Guided Attention Patterns
CN107609501A (en) * 2017-09-05 2018-01-19 东软集团股份有限公司 The close action identification method of human body and device, storage medium, electronic equipment
CN109522531A (en) * 2017-09-18 2019-03-26 腾讯科技(北京)有限公司 Official documents and correspondence generation method and device, storage medium and electronic device
CN109522531B (en) * 2017-09-18 2023-04-07 腾讯科技(北京)有限公司 Document generation method and device, storage medium and electronic device
CN110019952A (en) * 2017-09-30 2019-07-16 华为技术有限公司 Video presentation method, system and device
CN110019952B (en) * 2017-09-30 2023-04-18 华为技术有限公司 Video description method, system and device
CN107844751A (en) * 2017-10-19 2018-03-27 陕西师范大学 The sorting technique of guiding filtering length Memory Neural Networks high-spectrum remote sensing
CN107844751B (en) * 2017-10-19 2021-08-27 陕西师范大学 Method for classifying hyperspectral remote sensing images of guide filtering long and short memory neural network
CN107818306B (en) * 2017-10-31 2020-08-07 天津大学 Video question-answering method based on attention model
CN107818306A (en) * 2017-10-31 2018-03-20 天津大学 A kind of video answering method based on attention model
CN108200483A (en) * 2017-12-26 2018-06-22 中国科学院自动化研究所 Dynamically multi-modal video presentation generation method
CN108200483B (en) * 2017-12-26 2020-02-28 中国科学院自动化研究所 A dynamic multimodal video description generation method
CN108491208A (en) * 2018-01-31 2018-09-04 中山大学 A kind of code annotation sorting technique based on neural network model
CN108307229B (en) * 2018-02-02 2023-12-22 新华智云科技有限公司 Video and audio data processing method and device
CN108307229A (en) * 2018-02-02 2018-07-20 新华智云科技有限公司 A kind of processing method and equipment of video-audio data
CN110119750A (en) * 2018-02-05 2019-08-13 浙江宇视科技有限公司 Data processing method, device and electronic equipment
CN108765383B (en) * 2018-03-22 2022-03-18 山西大学 Video description method based on deep migration learning
CN108765383A (en) * 2018-03-22 2018-11-06 山西大学 Video presentation method based on depth migration study
CN108830287A (en) * 2018-04-18 2018-11-16 哈尔滨理工大学 The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN108881950B (en) * 2018-05-30 2021-05-25 北京奇艺世纪科技有限公司 Video processing method and device
CN108683924A (en) * 2018-05-30 2018-10-19 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
CN108881950A (en) * 2018-05-30 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
CN109522451B (en) * 2018-12-13 2024-02-27 连尚(新昌)网络科技有限公司 Repeated video detection method and device
CN109522451A (en) * 2018-12-13 2019-03-26 连尚(新昌)网络科技有限公司 Repeat video detecting method and device
CN111325068B (en) * 2018-12-14 2023-11-07 北京京东尚科信息技术有限公司 Video description method and device based on convolutional neural network
CN111325068A (en) * 2018-12-14 2020-06-23 北京京东尚科信息技术有限公司 Video description method and device based on convolutional neural network
CN109711022A (en) * 2018-12-17 2019-05-03 哈尔滨工程大学 A submarine anti-sinking system based on deep learning
CN109960747B (en) * 2019-04-02 2022-12-16 腾讯科技(深圳)有限公司 Video description information generation method, video processing method and corresponding devices
CN109960747A (en) * 2019-04-02 2019-07-02 腾讯科技(深圳)有限公司 The generation method of video presentation information, method for processing video frequency, corresponding device
US11861886B2 (en) 2019-04-02 2024-01-02 Tencent Technology (Shenzhen) Company Limited Method and apparatus for generating video description information, and method and apparatus for video processing
WO2020220702A1 (en) * 2019-04-29 2020-11-05 北京三快在线科技有限公司 Generation of natural language
CN110210499B (en) * 2019-06-03 2023-10-13 中国矿业大学 An adaptive generation system for image semantic description
CN110188779A (en) * 2019-06-03 2019-08-30 中国矿业大学 A Method for Generating Image Semantic Description
CN110210499A (en) * 2019-06-03 2019-09-06 中国矿业大学 A kind of adaptive generation system of image, semantic description
WO2021056750A1 (en) * 2019-09-29 2021-04-01 北京市商汤科技开发有限公司 Search method and device, and storage medium
US11328512B2 (en) 2019-09-30 2022-05-10 Wipro Limited Method and system for generating a text summary for a multimedia content
CN110765921A (en) * 2019-10-18 2020-02-07 北京工业大学 A video object localization method based on weakly supervised learning and video spatiotemporal features
CN110765921B (en) * 2019-10-18 2022-04-19 北京工业大学 Video object positioning method based on weak supervised learning and video spatiotemporal features
CN110781345B (en) * 2019-10-31 2022-12-27 北京达佳互联信息技术有限公司 Video description generation model obtaining method, video description generation method and device
CN110781345A (en) * 2019-10-31 2020-02-11 北京达佳互联信息技术有限公司 Video description generation model acquisition method, video description generation method and device
CN111461974A (en) * 2020-02-17 2020-07-28 天津大学 Image scanning path control method based on L STM model from coarse to fine
CN111461974B (en) * 2020-02-17 2023-04-25 天津大学 Image scanning path control method based on coarse-to-fine LSTM model
CN111400545A (en) * 2020-03-01 2020-07-10 西北工业大学 A video annotation method based on deep learning
CN111404676B (en) * 2020-03-02 2023-08-29 北京丁牛科技有限公司 Method and device for generating, storing and transmitting secret key and ciphertext
CN111404676A (en) * 2020-03-02 2020-07-10 北京丁牛科技有限公司 Method and device for generating, storing and transmitting secure and secret key and cipher text
CN111488807A (en) * 2020-03-29 2020-08-04 复旦大学 Video description generation system based on graph convolution network
CN111488807B (en) * 2020-03-29 2023-10-10 复旦大学 Video description generation system based on graph rolling network
CN111681676B (en) * 2020-06-09 2023-08-08 杭州星合尚世影视传媒有限公司 Method, system, device and readable storage medium for constructing audio frequency by video object identification
CN111681676A (en) * 2020-06-09 2020-09-18 杭州星合尚世影视传媒有限公司 Method, system and device for identifying and constructing audio frequency by video object and readable storage medium
CN111931690A (en) * 2020-08-28 2020-11-13 Oppo广东移动通信有限公司 Model training method, device, equipment and storage medium
CN111931690B (en) * 2020-08-28 2024-08-13 Oppo广东移动通信有限公司 Model training method, device, equipment and storage medium
CN113191262A (en) * 2021-04-29 2021-07-30 桂林电子科技大学 Video description data processing method, device and storage medium
CN113641854B (en) * 2021-07-28 2023-09-26 上海影谱科技有限公司 Method and system for converting text into video
CN113641854A (en) * 2021-07-28 2021-11-12 上海影谱科技有限公司 Method and system for converting characters into video
CN119011953A (en) * 2024-09-14 2024-11-22 广州九微信息科技有限公司 Video-on-demand and audio-frequency service system and method based on cloud computing

Also Published As

Publication number Publication date
CN105279495B (en) 2019-06-04

Similar Documents

Publication Publication Date Title
CN105279495B (en) A video description method based on deep learning and text summarization
CN106503055B (en) A Generating Method from Structured Text to Image Descriptions
US20210256051A1 (en) Theme classification method based on multimodality, device, and storage medium
CN109543084B (en) Method for establishing detection model of hidden sensitive text facing network social media
CN110019839B (en) Method and system for constructing medical knowledge graph based on neural network and remote supervision
CN112270196B (en) Entity relationship identification method and device and electronic equipment
WO2019200806A1 (en) Device for generating text classification model, method, and computer readable storage medium
CN106682192B (en) A method and device for training an answer intent classification model based on search keywords
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN104142995B (en) The social event recognition methods of view-based access control model attribute
CN106126619A (en) A kind of video retrieval method based on video content and system
CN111814477B (en) Dispute focus discovery method, device and terminal based on dispute focus entity
CN108280057A (en) A kind of microblogging rumour detection method based on BLSTM
CN114661872B (en) A beginner-oriented API adaptive recommendation method and system
CN110377778A (en) Figure sort method, device and electronic equipment based on title figure correlation
CN114818724B (en) A method for constructing an effective disaster information detection model on social media
CN113343690A (en) Text readability automatic evaluation method and device
CN117033558A (en) BERT-WWM and multi-feature fused film evaluation emotion analysis method
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN105975497A (en) Automatic microblog topic recommendation method and device
CN110046353A (en) An Aspect-Level Sentiment Analysis Method Based on Multilingual Hierarchical Mechanism
CN114579741A (en) Syntactic information fused GCN-RN aspect level emotion analysis method and system
CN110110137A (en) Method and device for determining music characteristics, electronic equipment and storage medium
CN106599824A (en) GIF cartoon emotion identification method based on emotion pairs
CN115775349A (en) False news detection method and device based on multi-mode fusion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220322

Address after: 511400 4th floor, No. 685, Shiqiao South Road, Panyu District, Guangzhou, Guangdong

Patentee after: GUANGZHOU WELLTHINKER AUTOMATION TECHNOLOGY CO.,LTD.

Address before: 300072 Tianjin City, Nankai District Wei Jin Road No. 92

Patentee before: Tianjin University