CN105279495B - A kind of video presentation method summarized based on deep learning and text - Google Patents

A kind of video presentation method summarized based on deep learning and text Download PDF

Info

Publication number
CN105279495B
CN105279495B CN201510697454.3A CN201510697454A CN105279495B CN 105279495 B CN105279495 B CN 105279495B CN 201510697454 A CN201510697454 A CN 201510697454A CN 105279495 B CN105279495 B CN 105279495B
Authority
CN
China
Prior art keywords
video
description
sentence
neural network
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510697454.3A
Other languages
Chinese (zh)
Other versions
CN105279495A (en
Inventor
李广
马书博
韩亚洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Wellthinker Automation Technology Co ltd
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201510697454.3A priority Critical patent/CN105279495B/en
Publication of CN105279495A publication Critical patent/CN105279495A/en
Application granted granted Critical
Publication of CN105279495B publication Critical patent/CN105279495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of video presentation methods summarized based on deep learning and text, comprising: by existing image data set according to image classification task training convolutional neural networks model;To video extraction sequence of frames of video, and convolutional neural networks model extraction convolutional neural networks feature is utilized, compositions<sequence of frames of video, text describes sequence>to the input as recurrent neural networks model, trains and obtains recurrent neural networks model;The sequence of frames of video of video to be described, is described by the recurrent neural networks model that training obtains, obtains description sequence;By the method for the conspicuousness that the vocabulary centrad based on figure is summarized as text, description sequence is ranked up, the final description result of video is exported.By occurent event and thingness relevant to event in one section of video of natural language description, to achieve the purpose that video content is described and summarizes.

Description

Video description method based on deep learning and text summarization
Technical Field
The invention relates to the field of video description, in particular to a video description method based on deep learning and text summarization.
Background
Describing a video in natural language is extremely important both for understanding the video and for retrieving the video on the Web. Meanwhile, the language description of video is also the subject of intensive research in the fields of multimedia and computer vision. The video description means that for a given video, the video characteristics are obtained by observing the contents contained in the video, and corresponding sentences are generated according to the contents. When people see a video, especially videos of some action categories, after the people watch the video, the people can know the video to a certain extent, and can speak things happening in the video through the language. For example: the video is described using the sentence "a person is riding a motorcycle". However, in the case of a large number of videos, a great deal of time, labor and financial resources are required to describe the videos one by one in a manual manner. It is necessary to analyze the video features using computer technology and combine with natural language processing methods to generate descriptions of the video. On one hand, through the video description method, people can more accurately understand the video from the semantic perspective. On the other hand, in the field of video retrieval, it is very difficult and challenging for a user to input a text description to retrieve a corresponding video.
Various video description methods have emerged over the past few years, such as: by analyzing the video characteristics, the objects existing in the video and the action relationship among the objects can be identified. Then, adopting a fixed language template: and the subject + verb + object determines the subject and the object from the recognized objects and takes the action relationship between the objects as a predicate, and the description of the sentence on the video is generated in such a way.
However, such a method has certain limitations, for example, generating sentences by using language templates easily results in relatively fixed sentence patterns of the generated sentences, and the sentence patterns are too single and lack colors expressed by natural human languages. Meanwhile, different characteristics are needed for identifying objects, actions and the like in the video, so that the steps are relatively complicated, and a large amount of time is needed for training the video characteristics. Moreover, the recognition accuracy directly affects the quality of the generated sentences, and the step-by-step method needs to ensure higher correctness at each step and is difficult to implement.
Disclosure of Invention
The invention provides a video description method based on deep learning and text summarization, which describes the happening events and the object attributes related to the events in a section of video through natural language, thereby achieving the purpose of describing and summarizing the video content, and the details are described as follows:
a video description method based on deep learning and text summarization is characterized by comprising the following steps:
downloading videos from the Internet, describing each video to form a pair of < videos, description > and form a text description training set;
training a convolutional neural network model according to an image classification task through an existing image data set;
extracting a video frame sequence from the video, extracting the characteristics of a convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, a text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model;
describing a video frame sequence of a video to be described through a trained recurrent neural network model to obtain a description sequence;
and sequencing the description sequence by a method of taking the word centrality of the graph as the significance of the text summary, and outputting the final description result of the video.
The downloading of videos from the internet and the description of each video form a < video, description > pair, and the forming of a text description training set specifically comprises:
and forming a < video, description > pair by the existing video set and the sentence description corresponding to each video to form a text description training set.
The steps of extracting a video frame sequence from the video, extracting the characteristics of the convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model specifically comprise:
extracting the convolutional neural network characteristics of the image and sentence descriptions corresponding to the image for modeling by using the parameters after the convolutional neural network model is trained to obtain a target function;
constructing a recurrent neural network; modeling the nonlinear function through a long-time memory network;
and optimizing the objective function by using a gradient descent mode, and obtaining the trained long-time and short-time memory network parameters.
The step of describing the video frame sequence of the video to be described by the trained recurrent neural network model to obtain the description sequence specifically comprises the following steps:
extracting the convolutional neural network characteristics of each image by using the trained model parameters and the convolutional neural network model to obtain image characteristics;
and taking the image characteristics as input and obtaining sentence description by using the model parameters obtained by training so as to obtain the sentence description corresponding to the video.
The technical scheme provided by the invention has the beneficial effects that: each video is composed of a frame sequence, the bottom layer characteristics of each frame of the video are extracted by using a convolutional neural network, and by adopting the method, excessive noise points caused by the traditional method of extracting the video characteristics by using deep learning can be effectively avoided, and the accuracy of generating sentences at the later stage is reduced. Each frame picture is converted into a sentence using a trained recurrent neural network, thereby generating a set of sentences. And the method for automatically summarizing the text is used for screening out high-quality and representative sentences from the sentence set by calculating the centrality between the sentences as the description of the video, and the method can generate better video description effect and accuracy and sentence diversity. Meanwhile, the method based on the depth and text summarization can be effectively popularized to the application of video retrieval, but the method is only limited to English description of video content.
Drawings
FIG. 1 is a flow chart of a video description method based on deep learning and text summarization;
FIG. 2 is a schematic diagram of a convolutional neural network model (CNN) used in the present invention;
wherein Cov represents a convolution kernel; ReLU is expressed by the formula max (0, x); pool stands for Pooling operation; LRN is local corresponding normalization operation; softmax is the objective function.
FIG. 3 is a schematic diagram of a recurrent neural network used in the present invention;
wherein t represents the input in the t state; h ist-1A hidden state representing a previous state; i is input gate; f is forgetgate; o is output gate; c is a cell; m istIs the output after passing through an LSTM unit.
FIG. 4(a) is a drawing showing a LexRank pruned connection;
wherein S ═ { S ═ S1,…,S1010 sentences generated by a Recurrent Neural Network (RNN) are represented as 10 nodes by adopting a graph mode; the similarity between the nodes is represented by straight lines and forms a full-connection graph, and the thickness of the connecting lines represents the size of the similarity.
FIG. 4(b) is an initial full-link diagram of LexRank;
by setting a threshold, the connecting lines with small similarity between the nodes are removed, and the connecting lines between the rest nodes, namely the similarity between sentences is high.
Fig. 5 is a schematic diagram of a sentence generated after a part of a video frame is described.
Wherein, each frame of image is followed by a sentence generated by adopting the CNN-RNN model used in the invention, and the arrow pointing part of the sentence is the summary of the video text description after LexRank method and is used as the text description of the video.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Based on the problems in the background art and after the description effect of the image is remarkably improved by using the deep learning method in the image, people are inspired from the image, and the diversity and the correctness of the generated video description are improved to a certain extent by using the deep learning method in the video.
The embodiment of the invention provides a video description method based on deep learning and text summarization. Each video feature is then taken as input into a recurrent neural network framework with which a sentence description can be generated for each visual feature, i.e. each frame of the video. Thus, a sentence set is obtained, in order to obtain the most expressive and high-quality sentences as the description of the video, the method adopts a text summarization method, and all sentences are sequenced by calculating the similarity between the sentences, so that some wrong sentences and low-quality sentences are avoided as the final description of the video. By adopting the automatic text summarization method, a representative sentence can be obtained, and certain correctness and reliability are achieved, so that the accuracy of video description is improved. Meanwhile, the method also overcomes some technical difficulties faced by video retrieval.
Example 1
A video description method based on deep learning and text summarization, referring to fig. 1, the method comprising the steps of:
101: downloading videos from the Internet, describing each video (English description) to form a < video, description > pair, and forming a text description training set, wherein each video corresponds to a plurality of sentence descriptions, so that a text description sequence is formed;
102: training a Convolutional Neural Network (CNN) model according to an image classification task by using the existing image data set;
for example: ImageNet.
103: extracting a video frame sequence from a video, extracting CNN characteristics by using a Convolutional Neural Network (CNN) model, forming a < video frame sequence, a text description sequence > as an input of a Recurrent Neural Network (RNN) model, and training to obtain the Recurrent Neural Network (RNN) model;
104: describing a video frame sequence of a video to be described by using the RNN model obtained by training to obtain a description sequence;
105: the reasonableness of the description sequence is ranked by using a method based on the lexical centrality of the graph as the significance (LexRank) of the text summary, and the most reasonable description is selected as the final description of the video.
In summary, the embodiments of the present invention implement, through steps 101 to 105, to describe an event occurring in a segment of video and an object attribute related to the event through a natural language, so as to achieve the purpose of describing and summarizing video content.
Example 2
201: downloading images from the Internet, describing each video to form a pair of < video and description >, and forming a text description training set;
the method specifically comprises the following steps:
(1) downloading Microsoft Research Video description data set (Microsoft Research Video) from the Internet Description Corpus), this data set comprising 1970 video segments collected from YouTube, which data set may representIs composed ofWherein N isdIs the total number of videos in the aggregate VID.
(2) Each video has a plurality of corresponding descriptions, and each Sentence of the video is described as "sequences ═ sequences1,…,SentenceNN represents a Sentence corresponding to each video (sequence)1,…,SentenceN) The number of descriptions of (c).
(3) The < video, description > pair is formed by the existing video set VID and the sentence description sequences corresponding to each video, and a text description training set is formed.
202: training a Convolutional Neural Network (CNN) model according to an image classification task by using the existing image data set, and training CNN model parameters;
the method specifically comprises the following steps:
(1) constructing AlexNet [1] CNN model shown in FIG. 2: the model comprises 8 network layers, wherein the first 5 layers are convolutional layers, and the last 3 layers are full-connection layers.
(2) Using Imagenet as a training set, each picture in the image dataset was sampled to 256 × 256 sizeThe picture of (a) is displayed on the screen,as input, NmIs the number, root, of the pictureAccording to the network layer set in fig. 2, layer 1 can be expressed as:
F1(IMAGE)=norm{pool[max(0,W1*IMAGE+B1)]} (1)
wherein IMAGE represents an input IMAGE; w1Representing convolution kernel parameters; b is1Represents a bias; f1(IMAGE) is expressed as an output after passing through the first layer network; norm denotes the normalization operation. At this pointIn the network layer, by a linear correction function (max (0, x), x is W1*IMAGE+B1) Processing the convolved image, performing a mapping pool operation, and performing local corresponding normalization (LRN) on the convolved image, wherein the normalization mode is as follows:
wherein M is the number of feature maps after posing; i is the ith of the M feature maps; n is the size of local normalization, namely, normalization is carried out on every n feature maps; a isi x,yRepresenting the corresponding value at coordinate (x, y) in the ith feature map, k being the offset, α being the normalized parameter, bi x,yIs the output result after local corresponding normalization (LRN).
In AlexNet, k is 2, n is 5, α is 10-4,β=0.75。
Continuing to use the model, F1(IMAGE) as input to the second network layer, according to the second layer network layer, can be expressed as:
F2(IMAGE)=max(0,W2*F1(IMAGE)+B2) (3)
wherein, W2Representing convolution kernel parameters; b is2Represents a bias; f2(IMAGE) is expressed as the output after the second layer network. The first layer and the second layer are arranged in the same way, and the sizes of the mapping kernels of the convolution layer and the posing layer are changed.
According to the network setup of AlexNet, the remaining convolutional layers can be represented in turn as:
F3(IMAGE)=max(0,W3*F2(IMAGE)+B3) (4)
F4(IMAGE)=max(0,W4*F3(IMAGE)+B4) (5)
F5(IMAGE)=pool[max(0,W5*F4(IMAGE)+B5)] (6)
wherein, W3,W4,W5And B3,B4,B5The convolution parameters and the bias for each layer.
The last 3 layers are fully connected layers, and the network layer settings according to fig. 2 can be expressed in turn as:
F6(IMAGE)=fc[F5(IMAGE),θ1] (7)
F7(IMAGE)=fc[F6(IMAGE),θ2] (8)
F8(IMAGE)=fc[F7(IMAGE),θ3] (9)
wherein fc represents the full connection layer, θ1,θ2,θ3Parameters of three fully-connected layers are expressed, and the characteristic F of the last layer is expressed8(IMAGE) input to a multivariate classifier of 1000 classes for classification.
(3) According to the current network, a multivariate classifier is set, and the formula can be expressed as:
wherein l (Θ) is an objective function, m is the category of the image in Imagenet, and x(t)CNN features, y, extracted for each class after passing through Alexnet network(t)For each image corresponding label, Θ ═ Wp,Bpq1, 5, and q 1,2,3, which are parameters in each network layer. And optimizing the target function parameters by adopting a gradient descent method, thereby obtaining the parameter theta set by the Alexnet network.
203: extracting a video frame sequence from a video, extracting CNN characteristics by using a Convolutional Neural Network (CNN) model, forming a < video frame sequence, a text description sequence > as an input of a Recurrent Neural Network (RNN) model, and training to obtain the Recurrent Neural Network (RNN) model;
the method comprises the following steps:
(1) according to step 201, using the parameters after the CNN model training, extracting the CNN feature I of the image and the sentence description S corresponding to the image for modeling, wherein the objective function is as follows:
θ*=argmax∑logp(S|I;θ) (11)
wherein (S, I) represents an image-text pair in the training data; theta is a parameter to be optimized of the model; theta is the optimized parameter;
the training aims to maximize the sum of the logarithmic probabilities of the sentences generated by all samples under observation of a given input image I, and the conditional probability chain rule is used to calculate the probability p (SI; theta), where the expression is:
wherein S is0,S1,...,St-1,StRepresenting words in a sentence. For the unknown quantity p (S) in the formulat|I,S0,S1,...,St-1) Modeling was performed using a recurrent neural network.
(2) Constructing a Recurrent Neural Network (RNN):
with t-1 words as conditions, and representing the words as a hidden state h of fixed lengthtUntil a new input x appearstAnd updating the hidden state through a nonlinear function f, wherein the expression is as follows:
ht+1=f(ht,xt) (13)
wherein h ist+1Indicating the next hidden state.
(3) For the nonlinear function f, modeling is performed by constructing a long-time memory network (LSTM) as shown in FIG. 3;
wherein itFor inputting gate input gate, ftTo forget the door for gate, otTo output the gate output gate, c is the cell, and the update and output of each state can be expressed as:
it=σ(Wixxt+Wimmt-1) (14)
ft=σ(Wfxxt+Wfmmt-1) (15)
ot=σ(Woxxt+Wommt-1) (16)
pt+1=Softmax(mt) (19)
wherein,expressed as the product between gate values, the matrix W ═ Wix;Wim;Wfx;Wfm;Wox;Wom;Wcx;Wix;WcmIs the parameter to be trained, and σ (-) is an sigmoid function (e.g., σ (W)ixxt+Wimmt-1)、σ(Wfxxt+Wfmmt-1) As an sigmoid function), h (-) is a hyperbolic tangent function (e.g.: h (W)cxxt+Wcmmt-1) As a hyperbolic tangent function). p is a radical oft+1Is the probability distribution of the next word after the Softmax classification; m istIs a current state feature.
(4) And optimizing the objective function (11) by using a gradient descent mode, and obtaining a trained long-time and short-time memory network LSTM parameter W.
204: describing a video frame sequence of a video to be described by using the RNN model obtained by training to obtain a description sequence, wherein the predicting step comprises the following steps:
(1) extracting test setsNtTo testThe number of the video sets, t, is the video set under test, and 10 frames of images are extracted from each video set, which can be expressed as:
(2) using the trained model parameters theta ═ { W }i,Bij1, 5, j 1,2,3, and extracting Image using a CNN modeltThe CNN characteristic of each image is obtained to obtain an image characteristic It={It 1,…,It 10}。
(3) Characterizing an image ItUsing the model parameters W obtained by training as input, equation (12) is obtained, and the sentence description S ═ S is obtained1,…,Sn}. Thereby obtaining the sentence description corresponding to the video.
205: and sorting the reasonableness of the description sequence by using a LexRank method, and selecting the most reasonable description as a final description of the video.
(1) Video feature sequence I by using RNN modelt={It 1,…,It 10Testing to generate corresponding sentence set S ═ S1,…Si,…,Sn}。
(2) Generating sentence featuresCharacteristically, each sentence S in all sentence sets is scanned sequentiallyiWherein i is 1, …, NdOne for each different word, forming a vocabulary of word list representations VOL ═ wi,…,wNwIn which N iswIs the total number of words in the vocabulary VOL. For each word w in the vocabulary VOLiSequentially scanning each sentence S in the set S of sentencesjCounting each word wiIn each sentence SjNumber of occurrences nijWhere j is 1, …, Ns,NsIs the total number of sentences and counts the words w contained in the set SiNumber of accompanying text num (w)i) (ii) a Calculate each word w according to equation (20)iIn each sentence SjWord frequency tf (w) of (1)i,sj) Where i is 1, …, Nd,NdIs the total number of words in the vocabulary, j is 1, …, Ns,NsIs the total number of all sentences S in the set;
wherein n iskjIs the number of occurrences of the kth word in the jth sentence.
For each word w in the vocabulary VOLiCalculating the inverse document word frequency idf (w) according to the formula (21)i);
idf(wi)=log(Nd/num(wi)) (21)
Wherein N isdAs the number of words per sentence.
According to the vector space model, each sentence S in the set SjIs expressed as NwDimension vector, i-th dimension corresponding to word w in the vocabularyiThe value of tfidf (w)i) The calculation formula is as follows:
tfidf(wi)=tf(wi,sj)×idf(wi) (22)
(3) using two vectors Si,SjThe cosine value between them is used as sentence similarity, and the calculation formula is as follows:
wherein,in sentence S for each word wiThe word frequency of (1);in sentence S for each word wjThe word frequency of (1); idfwAn inverse document word frequency for each word; smAs a sentence SiAny one of the words;as a word smAt SiThe word frequency of (1);as a word smThe inverse document word frequency of; snAs a sentence SjAny one of the words;as a word snAt SjThe word frequency of (1);as a word snThe inverse document word frequency.
And form a fully connected undirected graph, as in FIG. 4(a), with each node uiAs a sentence SiAnd the edges between the nodes are taken as sentence similarity.
(4) A threshold value Degree is set, and all edges with similarity less than Degree are deleted, as shown in fig. 4 (b).
(5) Calculate each sentence node uiThe LexRank score LR, the initial score of each sentence node is: d/N, where N is the number of sentence nodes, d is the damping factor, and d is usually selected to be [0.1,0.2 ]]The score LR is calculated according to equation (4):
wherein deg (v) is the threshold of node v; LR (u) is the score for node u; LR (v) is the score for node v.
(6) And calculating the LR score of each sentence node, sequencing, and selecting the sentence with the highest score as the final description of the video.
In summary, the embodiment of the present invention, through steps 201 to 205, describes an event occurring in a segment of video and an object attribute related to the event through a natural language, so as to achieve the purpose of describing and summarizing video content.
Example 3
Two videos are selected as videos to be described, and as shown in fig. 5, the videos are predicted by using the method based on deep learning and text summarization in the present invention to output corresponding video descriptions:
(1) using ImageNet as a training set, each picture in the data set was sampled to 256 × 256 size picturesThe sheet is a sheet of a plastic material,as input, NmThe number of pictures.
(2) Building a first layer of convolution layer, setting the size of a convolution kernel cov1 as 11, the step length stride as 4, selecting ReLU as max (0, x), performing posing operation on the convolved feature map, setting the size of the kernel as 3 and the step length stride as 2, and using local corresponding normalizationThe convolved data are normalized, in AlexNet, k is 2, n is 5, and α is 10-4,β=0.75。
(3) And building a second layer of convolution layer, setting the size of a convolution kernel cov2 as 5, setting the step length stride as 1, selecting ReLU as max (0, x), performing posing operation on the convolved feature map, setting the size of the kernel as 3 and the step length stride as 2, and normalizing the convolved data by using local corresponding normalization.
(4) And building a third layer of convolution layer, setting the size of a convolution kernel cov3 to be 3, the step length stride to be 1, and selecting ReLU to be max (0, x).
(5) And building a fourth layer of convolution layer, setting the size of a convolution kernel cov4 to be 3, the step length stride to be 1, and selecting ReLU to be max (0, x).
(6) And building a fifth layer convolution layer, setting the size of a convolution kernel cov5 to be 3, the step length stride to be 1, selecting ReLU to be max (0, x), and performing posing operation on the convolved feature map, wherein the size of the kernel is 3, and the step length stride to be 2.
(7) And building a sixth full connection layer, setting the layer as fc6, selecting ReLU as max (0, x), and performing dropout on the processed data.
(8) And building a seventh full connection layer, setting the layer as fc7, selecting ReLU as max (0, x), and performing dropout on the processed data.
(9) And building an eighth fully-connected layer, setting the layer to fc8, and adding a Softmax classifier as an objective function.
(10) And establishing a Convolutional Neural Network (CNN) model by setting the eight-layer network layers.
(11) And training CNN model parameters.
(12) Data processing: each video in the data set was uniformly extracted into 10 frames and sampled to 256 x 256 size. Inputting the image into a trained CNN model to obtain image characteristics, wherein each frame of image randomly corresponds to 5 sentences of text expression of the video to be used as image-text pairs
(13) A Recurrent Neural Network (RNN) model is constructed.
Fig. 5 is a video text description result generated after the invention. The image in the figure is divided into video frames extracted from a video, and sentences corresponding to each frame of image are the results obtained after the video features pass through a language model. The lower part of the picture represents the sentence generated by the video feature and the image migration and the description of the video script only after being summarized.
In summary, the embodiment of the present invention converts the frame sequence of each video into a series of sentences through the convolutional neural network and the cyclic neural network, and selects high-quality and representative sentences from a plurality of sentences through a text summarization method. The user can obtain the description of the video by using the method, the description accuracy is high, and the method can be popularized to the retrieval of the video.
Reference to the literature
[1] Krizhevsky a, sutsker I, Hinton g. image classification method based on deep convolutional neural networks [ J ] neural information processing system evolution, 2012.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (2)

1. A video description method based on deep learning and text summarization is characterized by comprising the following steps:
1) downloading videos from the Internet, describing each video to form a pair of < videos, description > and form a text description training set;
2) training a convolutional neural network model according to an image classification task through an existing image data set;
3) extracting a video frame sequence from the video, extracting the characteristics of a convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, a text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model;
extracting the convolutional neural network characteristics of the image and sentence descriptions corresponding to the image for modeling by using the parameters after training the convolutional neural network model to obtain a target function;
constructing a recurrent neural network; modeling the nonlinear function through a long-time memory network;
optimizing the objective function by using a gradient descent mode, and obtaining a trained long-time and short-time memory network parameter;
4) describing a video frame sequence of a video to be described through a trained recurrent neural network model to obtain a description sequence;
extracting the convolutional neural network characteristics of each image by using the trained model parameters and the convolutional neural network model to obtain image characteristics;
the image characteristics are used as input, sentence description is obtained by using model parameters obtained by training, and accordingly sentence description corresponding to the video is obtained;
5) sequencing the description sequence by using a method based on the vocabulary centrality of the graph as the significance of the text summary, and outputting the final description result of the video;
the description sequence is sequenced, and the final description result of the output video specifically includes:
video feature sequence I by using RNN modelt={It 1,…,It 10Testing to generate a corresponding sentence set;
generating sentence characteristics, and sequentially scanning each sentence S in all sentence setsiOne for each different word in all the words in (1), and forming a vocabulary table represented by a word list; using two vectors Si,SjThe cosine value between them is used as sentence similarity; setting a threshold value Degree, and deleting all edges with similarity smaller than Degree;
calculate each sentence node uiLexRank score LR, initial of each sentence nodeThe fraction is: d/N, where N is the number of sentence nodes, d is the damping factor, and d is usually selected to be [0.1,0.2 ]]The score LR is calculated according to the following formula:
wherein deg (v) is the threshold of node v; LR (u) is the score for node u; LR (v) is the score for node v.
2. The video description method based on deep learning and text summarization as claimed in claim 1, wherein the downloading of videos from the internet and the description of each video form < video, description > pairs, and the forming of the text description training set specifically comprises:
and forming a < video, description > pair by the existing video set and the sentence description corresponding to each video to form a text description training set.
CN201510697454.3A 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text Active CN105279495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510697454.3A CN105279495B (en) 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510697454.3A CN105279495B (en) 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text

Publications (2)

Publication Number Publication Date
CN105279495A CN105279495A (en) 2016-01-27
CN105279495B true CN105279495B (en) 2019-06-04

Family

ID=55148479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510697454.3A Active CN105279495B (en) 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text

Country Status (1)

Country Link
CN (1) CN105279495B (en)

Families Citing this family (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017168252A1 (en) * 2016-03-31 2017-10-05 Maluuba Inc. Method and system for processing an input query
US9984772B2 (en) * 2016-04-07 2018-05-29 Siemens Healthcare Gmbh Image analytics question answering
CN105894043A (en) * 2016-04-27 2016-08-24 上海高智科技发展有限公司 Method and system for generating video description sentences
CN107391505B (en) * 2016-05-16 2020-10-23 腾讯科技(深圳)有限公司 Image processing method and system
CN106126492B (en) * 2016-06-07 2019-02-05 北京高地信息技术有限公司 Sentence recognition methods and device based on two-way LSTM neural network
CN106227793B (en) * 2016-07-20 2019-10-22 优酷网络技术(北京)有限公司 A kind of determination method and device of video and the Video Key word degree of correlation
CN107707931B (en) * 2016-08-08 2021-09-10 阿里巴巴集团控股有限公司 Method and device for generating interpretation data according to video data, method and device for synthesizing data and electronic equipment
CN106372107B (en) * 2016-08-19 2020-01-17 中兴通讯股份有限公司 Method and device for generating natural language sentence library
CN107784372B (en) * 2016-08-24 2022-02-22 阿里巴巴集团控股有限公司 Target object attribute prediction method, device and system
CN106503055B (en) * 2016-09-27 2019-06-04 天津大学 A kind of generation method from structured text to iamge description
CN106485251B (en) * 2016-10-08 2019-12-24 天津工业大学 Egg embryo classification based on deep learning
GB2555431A (en) * 2016-10-27 2018-05-02 Nokia Technologies Oy A method for analysing media content
CN106650789B (en) * 2016-11-16 2023-04-07 同济大学 Image description generation method based on depth LSTM network
CN106782602B (en) * 2016-12-01 2020-03-17 南京邮电大学 Speech emotion recognition method based on deep neural network
CN106599198B (en) * 2016-12-14 2021-04-06 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method of multi-cascade junction cyclic neural network
CN106650756B (en) * 2016-12-28 2019-12-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 knowledge migration-based image text description method of multi-mode recurrent neural network
CN106845411B (en) * 2017-01-19 2020-06-30 清华大学 Video description generation method based on deep learning and probability map model
CN106934352A (en) * 2017-02-28 2017-07-07 华南理工大学 A kind of video presentation method based on two-way fractal net work and LSTM
CN106886768A (en) * 2017-03-02 2017-06-23 杭州当虹科技有限公司 A kind of video fingerprinting algorithms based on deep learning
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN107038221B (en) * 2017-03-22 2020-11-17 杭州电子科技大学 Video content description method based on semantic information guidance
CN108665055B (en) * 2017-03-28 2020-10-23 深圳荆虹科技有限公司 Method and device for generating graphic description
DE102017205713A1 (en) 2017-04-04 2018-10-04 Siemens Aktiengesellschaft Method and control device for controlling a technical system
CN108734614A (en) * 2017-04-13 2018-11-02 腾讯科技(深圳)有限公司 Traffic congestion prediction technique and device, storage medium
US10872273B2 (en) * 2017-05-02 2020-12-22 Kodak Alaris Inc. System and method for batch-normalized recurrent highway networks
CN107203598A (en) * 2017-05-08 2017-09-26 广州智慧城市发展研究院 A kind of method and system for realizing image switch labels
US10445871B2 (en) 2017-05-22 2019-10-15 General Electric Company Image analysis neural network systems
CN108228686B (en) * 2017-06-15 2021-03-23 北京市商汤科技开发有限公司 Method and device for realizing image-text matching and electronic equipment
CN107291882B (en) * 2017-06-19 2020-07-14 江苏赛睿信息科技股份有限公司 Automatic statistical analysis method for data
CN107515900B (en) * 2017-07-24 2020-10-30 宗晖(上海)机器人有限公司 Intelligent robot and event memo system and method thereof
CN107368887B (en) * 2017-07-25 2020-08-07 江西理工大学 Deep memory convolutional neural network device and construction method thereof
CN111133453B (en) 2017-08-04 2024-05-14 诺基亚技术有限公司 Artificial neural network
CN107578062A (en) * 2017-08-19 2018-01-12 四川大学 A kind of picture based on attribute probability vector guiding attention mode describes method
CN107609501A (en) * 2017-09-05 2018-01-19 东软集团股份有限公司 The close action identification method of human body and device, storage medium, electronic equipment
CN109522531B (en) * 2017-09-18 2023-04-07 腾讯科技(北京)有限公司 Document generation method and device, storage medium and electronic device
CN110019952B (en) * 2017-09-30 2023-04-18 华为技术有限公司 Video description method, system and device
CN107844751B (en) * 2017-10-19 2021-08-27 陕西师范大学 Method for classifying hyperspectral remote sensing images of guide filtering long and short memory neural network
CN107818306B (en) * 2017-10-31 2020-08-07 天津大学 Video question-answering method based on attention model
CN108200483B (en) * 2017-12-26 2020-02-28 中国科学院自动化研究所 Dynamic multi-modal video description generation method
CN108491208A (en) * 2018-01-31 2018-09-04 中山大学 A kind of code annotation sorting technique based on neural network model
CN108307229B (en) * 2018-02-02 2023-12-22 新华智云科技有限公司 Video and audio data processing method and device
CN110119750A (en) * 2018-02-05 2019-08-13 浙江宇视科技有限公司 Data processing method, device and electronic equipment
CN108765383B (en) * 2018-03-22 2022-03-18 山西大学 Video description method based on deep migration learning
CN108830287A (en) * 2018-04-18 2018-11-16 哈尔滨理工大学 The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN108683924B (en) * 2018-05-30 2021-12-28 北京奇艺世纪科技有限公司 Video processing method and device
CN108881950B (en) * 2018-05-30 2021-05-25 北京奇艺世纪科技有限公司 Video processing method and device
CN109522451B (en) * 2018-12-13 2024-02-27 连尚(新昌)网络科技有限公司 Repeated video detection method and device
CN111325068B (en) * 2018-12-14 2023-11-07 北京京东尚科信息技术有限公司 Video description method and device based on convolutional neural network
CN109711022B (en) * 2018-12-17 2022-11-18 哈尔滨工程大学 Submarine anti-sinking system based on deep learning
CN109960747B (en) 2019-04-02 2022-12-16 腾讯科技(深圳)有限公司 Video description information generation method, video processing method and corresponding devices
CN110096707B (en) * 2019-04-29 2020-09-29 北京三快在线科技有限公司 Method, device and equipment for generating natural language and readable storage medium
CN110210499B (en) * 2019-06-03 2023-10-13 中国矿业大学 Self-adaptive generation system for image semantic description
CN110188779A (en) * 2019-06-03 2019-08-30 中国矿业大学 A kind of generation method of image, semantic description
CN110659392B (en) * 2019-09-29 2022-05-06 北京市商汤科技开发有限公司 Retrieval method and device, and storage medium
US11328512B2 (en) 2019-09-30 2022-05-10 Wipro Limited Method and system for generating a text summary for a multimedia content
CN110765921B (en) * 2019-10-18 2022-04-19 北京工业大学 Video object positioning method based on weak supervised learning and video spatiotemporal features
CN110781345B (en) * 2019-10-31 2022-12-27 北京达佳互联信息技术有限公司 Video description generation model obtaining method, video description generation method and device
CN111461974B (en) * 2020-02-17 2023-04-25 天津大学 Image scanning path control method based on LSTM model from coarse to fine
CN111400545A (en) * 2020-03-01 2020-07-10 西北工业大学 Video annotation method based on deep learning
CN111404676B (en) * 2020-03-02 2023-08-29 北京丁牛科技有限公司 Method and device for generating, storing and transmitting secret key and ciphertext
CN111488807B (en) * 2020-03-29 2023-10-10 复旦大学 Video description generation system based on graph rolling network
CN111681676B (en) * 2020-06-09 2023-08-08 杭州星合尚世影视传媒有限公司 Method, system, device and readable storage medium for constructing audio frequency by video object identification
CN111931690B (en) * 2020-08-28 2024-08-13 Oppo广东移动通信有限公司 Model training method, device, equipment and storage medium
CN113191262B (en) * 2021-04-29 2022-08-19 桂林电子科技大学 Video description data processing method, device and storage medium
CN113641854B (en) * 2021-07-28 2023-09-26 上海影谱科技有限公司 Method and system for converting text into video

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442927B2 (en) * 2009-07-30 2013-05-14 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442927B2 (en) * 2009-07-30 2013-05-14 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization;Gunes Erkan;《Journal of Artificial Intelligence Research》;20041204;第22卷(第1期);第457-467页
Translating Videos to Natural Language Using Deep Recurrent Neural Networks;Subhashini Venugopalan等;《Computer Science》;20141219;第3-6页

Also Published As

Publication number Publication date
CN105279495A (en) 2016-01-27

Similar Documents

Publication Publication Date Title
CN105279495B (en) A kind of video presentation method summarized based on deep learning and text
CN106503055B (en) A kind of generation method from structured text to iamge description
US20210256051A1 (en) Theme classification method based on multimodality, device, and storage medium
CN108804512B (en) Text classification model generation device and method and computer readable storage medium
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
CN109670039B (en) Semi-supervised e-commerce comment emotion analysis method based on three-part graph and cluster analysis
CN107943784B (en) Relationship extraction method based on generation of countermeasure network
US20190102655A1 (en) Training data acquisition method and device, server and storage medium
CN110532379B (en) Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis
CN110851641B (en) Cross-modal retrieval method and device and readable storage medium
CN113641820A (en) Visual angle level text emotion classification method and system based on graph convolution neural network
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN109918539B (en) Audio and video mutual retrieval method based on user click behavior
CN108363790A (en) For the method, apparatus, equipment and storage medium to being assessed
WO2023035923A1 (en) Video checking method and apparatus and electronic device
CN112270196A (en) Entity relationship identification method and device and electronic equipment
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN111368075A (en) Article quality prediction method and device, electronic equipment and storage medium
CN110377778A (en) Figure sort method, device and electronic equipment based on title figure correlation
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN110717090A (en) Network public praise evaluation method and system for scenic spots and electronic equipment
CN111507089A (en) Document classification method and device based on deep learning model and computer equipment
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN115775349A (en) False news detection method and device based on multi-mode fusion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220322

Address after: 511400 4th floor, No. 685, Shiqiao South Road, Panyu District, Guangzhou, Guangdong

Patentee after: GUANGZHOU WELLTHINKER AUTOMATION TECHNOLOGY CO.,LTD.

Address before: 300072 Tianjin City, Nankai District Wei Jin Road No. 92

Patentee before: Tianjin University