CN105279495B - A kind of video presentation method summarized based on deep learning and text - Google Patents

A kind of video presentation method summarized based on deep learning and text Download PDF

Info

Publication number
CN105279495B
CN105279495B CN201510697454.3A CN201510697454A CN105279495B CN 105279495 B CN105279495 B CN 105279495B CN 201510697454 A CN201510697454 A CN 201510697454A CN 105279495 B CN105279495 B CN 105279495B
Authority
CN
China
Prior art keywords
video
sentence
neural networks
description
sequence
Prior art date
Application number
CN201510697454.3A
Other languages
Chinese (zh)
Other versions
CN105279495A (en
Inventor
李广
马书博
韩亚洪
Original Assignee
天津大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天津大学 filed Critical 天津大学
Priority to CN201510697454.3A priority Critical patent/CN105279495B/en
Publication of CN105279495A publication Critical patent/CN105279495A/en
Application granted granted Critical
Publication of CN105279495B publication Critical patent/CN105279495B/en

Links

Abstract

The invention discloses a kind of video presentation methods summarized based on deep learning and text, comprising: by existing image data set according to image classification task training convolutional neural networks model;To video extraction sequence of frames of video, and convolutional neural networks model extraction convolutional neural networks feature is utilized, compositions<sequence of frames of video, text describes sequence>to the input as recurrent neural networks model, trains and obtains recurrent neural networks model;The sequence of frames of video of video to be described, is described by the recurrent neural networks model that training obtains, obtains description sequence;By the method for the conspicuousness that the vocabulary centrad based on figure is summarized as text, description sequence is ranked up, the final description result of video is exported.By occurent event and thingness relevant to event in one section of video of natural language description, to achieve the purpose that video content is described and summarizes.

Description

A kind of video presentation method summarized based on deep learning and text

Technical field

The present invention relates to video presentation field more particularly to a kind of video presentation sides summarized based on deep learning and text Method.

Background technique

One video is described using natural language, the view still either is retrieved in Web to the understanding of the video Frequency is all extremely important.Meanwhile the language description of video is also the class of primary study in multimedia and computer vision field Topic.So-called video presentation, refers to given video, the content for being included by observing it, i.e. acquisition video features, and according to These contents generate corresponding sentence.When it is seen that the video of especially some action classifications is being watched when a video There can be a degree of understanding to the video after complete video, and can go to tell about the thing occurred in video by language.Example As: video is described in sentence using " people is riding motor ".However, a large amount of video is faced, using people The mode of work carries out description one by one to video and needs a large amount of time, manpower and financial resources.Using computer technology to video spy Sign is analyzed, and is combined with the method for natural language processing, and generation is necessary to the description of video.One side Face, by the method for video presentation, people can from semantic angle it is more accurate go understand video.On the other hand, it is regarding Frequency searching field, when user input passage description come retrieve corresponding video this be it is very difficult simultaneously And there is certain challenge.

Various video presentation methods have been emerged in the past few years, such as: by video features It is analyzed, can identify possessed action relationships between object and object present in video.Then using fixation Language template: subject+verb+object determines subject, object from identified object and makees the action relationships between object For predicate, description of the sentence to video is generated in this manner.

But such method has some limitations, and such as: sentence, which is generated, using language template is easy to cause generation Sentence clause it is relatively fixed, clause is excessively single, lack Human Natural Language expression color.Meanwhile it identifying in video Object and movement etc. are required to cause step relatively cumbersome using different features, and need a large amount of time to video features It is trained.Moreover, the accuracy rate of identification directly affects the quality for generating sentence, and the method for this multiple step format is needed every A step guarantees higher correctness, and realization has certain difficulty.

Summary of the invention

The present invention provides a kind of video presentation method summarized based on deep learning and text, the present invention passes through nature language Occurent event and thingness relevant to event in speech one section of video of description, carry out video content to reach Description and the purpose summarized, described below:

A kind of video presentation method summarized based on deep learning and text, which is characterized in that the video presentation method The following steps are included:

It is described from the Internet download video, and to each video, formation<video, description>right, constitutes text description Training set;

By existing image data set according to image classification task training convolutional neural networks model;

To video extraction sequence of frames of video, and convolutional neural networks model extraction convolutional neural networks feature is utilized, constitute < Sequence of frames of video, text describe sequence > to the input as recurrent neural networks model, and training obtains recurrent neural network mould Type;

The sequence of frames of video of video to be described, is described by the recurrent neural networks model that training obtains, is retouched State sequence;

By the method for the conspicuousness that the vocabulary centrad based on figure is summarized as text, description sequence is ranked up, Export the final description result of video.

Described to be described from the Internet download video, and to each video, formation<video, description>right constitutes text Training set is described specifically:

Composition<video is described by existing video collection and each video corresponding sentence, description>right constitutes text This describes training set.

It is described that and convolutional neural networks model extraction convolutional neural networks feature is utilized to video extraction sequence of frames of video, Composition<sequence of frames of video, text describe sequence>to the input as recurrent neural networks model, and training obtains recurrent neural net The step of network model specifically:

Using the parameter after training convolutional neural networks model, the convolutional neural networks feature and image of image are extracted Corresponding sentence description is modeled, and objective function is obtained;

Construct recurrent neural network;Nonlinear function is modeled by length time memory network;

The method optimizing objective function declined using gradient, and the length time memory network parameter after being trained.

The sequence of frames of video of video to be described, is described in the recurrent neural networks model obtained by training, obtains To the step of describing sequence specifically:

Using trained model parameter and use the convolutional neural networks of each image of convolutional neural networks model extraction Feature obtains characteristics of image;

Sentence description is obtained using characteristics of image as input and using the model parameter that training obtains, to obtain video pair The sentence description answered.

The beneficial effect of the technical scheme provided by the present invention is that: each video is made of a frame sequence, uses convolution Neural network extracts the low-level image feature of each frame of video, can effectively avoid traditional use deep learning from extracting using this method Video features introduce excessive noise, reduce the accuracy of later period generation sentence.It will be every using trained Recognition with Recurrent Neural Network One frame picture is converted to sentence, to generate the set of a sentence.And using the method that autotext is summarized by calculating sentence Centrad between son and the screening mass height from the set of sentence, description of the representative sentence as video, The diversity of better video presentation effect and accuracy and sentence can be generated using this method.Meanwhile using based on deep The method that degree and text are summarized can be effectively generalized in the application of video frequency searching, but this method is only limitted to video content English description.

Detailed description of the invention

Fig. 1 is a kind of flow chart of video presentation method summarized based on deep learning and text;

Fig. 2 convolutional neural networks model (CNN) schematic diagram used in the present invention;

Wherein, Cov indicates convolution kernel;ReLU representation formula is max (0, x);Pool indicates Pooling operation;LRN is office The corresponding normalization operation in portion;Softmax is objective function.

Fig. 3 recurrent neural network schematic diagram used in the present invention;

Wherein, t indicates the input under t state;ht-1Indicate the low-profile of laststate;I is input gate;F is forget gate;O is output gate;C is cell;mtFor the output after a LSTM unit.

Fig. 4 (a) is connection figure after LexRank beta pruning;

Wherein, S={ S1,…,S10It is 10 sentences generated by recurrent neural network (RNN), it will using chart-pattern This 10 sentence expressions are 10 nodes;Similarity between node and node indicates by straight line and constitutes full connection figure, The thickness of line indicates the size of similarity.

Fig. 4 (b) is LexRank initially full connection figure;

By the way that threshold value is arranged, the lesser line of similarity between node and node is removed, remaining node and node it Between line, that is, sentence between similarity it is higher.

Fig. 5 is the schematic diagram of partial video frame generated sentence after description.

Wherein, for using sentence generated after CNN-RNN model used in the present invention, arrow below every frame image Directed section is to describe to the summary of videotext description as the text of the video after LexRank method.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, embodiment of the present invention is made below further Ground detailed description.

The problem of based on background technique, and image is described using the method for deep learning in the picture After effect acquirement is obviously improved, people are therefrom inspired, and use the method for deep learning in video, the view generated The diversity and correctness of frequency description improve.

For this purpose, the embodiment of the present invention proposes a kind of video presentation method summarized based on deep learning and text, firstly, This method is extracted by visual signature of the convolutional neural networks frame to each frame of video.Then, by each video Feature, can be to each visual signature using this frame as being input in Recognition with Recurrent Neural Network frame, i.e., video is every One frame generates a description.In this way, just obtained the set of a sentence, most expression power and high quality in order to obtain Description of the sentence as the video, the method that this method uses text to summarize, by calculating the similarity between sentence to all Sentence is ranked up, the final description so as to avoid some wrong sentences and low-quality sentence as video.Using automatic The method not only available representative sentence that text is summarized, and there is certain correctness and reliability, To improve the accuracy of video presentation.Meanwhile this method also overcome that video frequency searching faced it is some technical tired It is difficult.

Embodiment 1

A kind of video presentation method summarized based on deep learning and text, referring to Fig. 1, method includes the following steps:

101: it is described (English describes) from the Internet download video, and to each video, formation<video, description> It is right, it constitutes text and describes training set, wherein each video corresponds to more descriptions, so that constituting a text describes sequence;

102: existing image data set is utilized, according to image classification task training convolutional neural networks (CNN) model;

Such as: ImageNet.

103: to video extraction sequence of frames of video, and convolutional neural networks (CNN) model extraction CNN feature is utilized, constitute < Sequence of frames of video, text describe sequence > to the input as recurrent neural network (RNN) model, and training obtains recurrent neural net Network (RNN) model;

104: the sequence of frames of video of video to be described, being described using the RNN model that training obtains, obtains description sequence Column;

105: the method for the conspicuousness (LexRank) summarized using the vocabulary centrad based on figure as text is to description sequence The reasonability of column is ranked up, and selects most rationally description as the final description to the video.

It is regarded in conclusion the embodiment of the present invention is realized by step 101- step 105 by one section of natural language description Occurent event and thingness relevant to event in frequency are described video content and summarize to reach Purpose.

Embodiment 2

201: being described from the Internet download image, and to each video, formation<video, description>right constitutes text Training set is described;

The step specifically includes:

(1) Microsoft Research video description data collection (Microsoft Research Video is downloaded from internet Description Corpus), this data set includes 1970 video-frequency bands collected from YouTube, and data set can indicate ForWherein NdIt is the video sum in set VID.

(2) each video can have multiple corresponding descriptions, and the sentence of each video is described as Sentences= {Sentence1,…,SentenceN, wherein N indicates sentence (Sentence corresponding to each video1,…, SentenceN) description number.

(3) Sentences composition < video is described by existing video collection VID and the corresponding sentence of each video, Description > right constitutes text and describes training set.

202: existing image data set is utilized, according to image classification task training convolutional neural networks (CNN) model, instruction Practice CNN model parameter;

The step specifically includes:

(1) construct AlexNet [1] CNN model shown in Fig. 2: the model includes 8 network layers, wherein first 5 layers are Convolutional layer, latter 3 layers are full articulamentums.

(2) it uses Imagenet as training set, each picture that image data is concentrated is sampled into 256*256 size Picture,As input, NmFor the number of picture, root According to the network layer that Fig. 2 is arranged, the 1st layer be may be expressed as:

F1(IMAGE)=norm { pool [max (0, W1*IMAGE+B1)]} (1)

Wherein, IMAGE indicates input picture;W1Indicate convolution nuclear parameter;B1Indicate biasing;F1(IMAGE) be expressed as by Output result after first layer network;Norm indicates normalization operation.In this network layer, pass through linearity rectification function (max (0, x), x W1*IMAGE+B1) image after convolution is handled, using mapping pool operation, and part is carried out to it Corresponding normalization (LRN), normalized mode are as follows:

Wherein, M is the number of Feature Mapping after pooling;I is i-th in M Feature Mapping;N is local normalizing The size of change, i.e., every n Feature Mapping are normalized;ai x,yIndicate corresponding under coordinate (x, y) in ith feature mapping Value;K is biasing;α, β are normalized parameter;bi x,yFor the output result after Local Phase should normalize (LRN).

In AlexNet, k=2, n=5, α=10-4, β=0.75.

Continue using the model, by F1(IMAGE) input as second network layer can according to second layer network layer It indicates are as follows:

F2(IMAGE)=max (0, W2*F1(IMAGE)+B2) (3)

Wherein, W2Indicate convolution nuclear parameter;B2Indicate biasing;F2(IMAGE) output being expressed as after the second layer network As a result.First layer is identical as the setting of the second layer, and only the size of convolutional layer and pooling layers of map-germ kernel becomes Change.

According to the network settings of AlexNet, remaining convolutional layer can be represented sequentially as:

F3(IMAGE)=max (0, W3*F2(IMAGE)+B3) (4)

F4(IMAGE)=max (0, W4*F3(IMAGE)+B4) (5)

F5(IMAGE)=pool [max (0, W5*F4(IMAGE)+B5)] (6)

Wherein, W3, W4, W5And B3, B4, B5Deconvolution parameter and biasing for each layer.

3 layers are full articulamentum afterwards, and network layer setting according to fig. 2 can be represented sequentially as:

F6(IMAGE)=fc [F5(IMAGE),θ1] (7)

F7(IMAGE)=fc [F6(IMAGE),θ2] (8)

F8(IMAGE)=fc [F7(IMAGE),θ3] (9)

Wherein, fc indicates full articulamentum, θ1, θ2, θ3Indicate the parameter of three full articulamentums, and by the feature of the last layer F8(IMAGE) the multivariate classification device for being input to 1000 classifications is classified.

(3) according to current network, multivariate classification device is set, formula may be expressed as:

Wherein, l (Θ) is objective function, and m is the classification of image in Imagenet, x(t)Pass through Alexnet for each classification The CNN feature extracted after network, y(t)For the corresponding label of each image, Θ={ Wp,Bpq, p=1 ..., 5, q=1, 2,3, the parameter in respectively each network layer.Objective function parameters are optimized using the method that gradient declines, thus To the parameter Θ of Alexnet network settings.

203: to video extraction sequence of frames of video, and convolutional neural networks (CNN) model extraction CNN feature is utilized, constitute < Sequence of frames of video, text describe sequence > to the input as recurrent neural network (RNN) model, and training obtains recurrent neural net Network (RNN) model;

The step specifically:

(1) the CNN feature I and image pair of image are extracted using the parameter after training CNN model according to step 201 The sentence answered describes S and is modeled, objective function are as follows:

θ*=argmax ∑ logp (S | I;θ) (11)

Wherein, (S, I) represents image-text pair in training data;θ is model parameter to be optimized;θ * is after optimizing Parameter;

Trained purpose be so that the sentence that all samples generate under the observation of given input picture I log probability it And maximum, using conditional probability chain rule calculate Probability p (S | I;θ), expression formula are as follows:

Wherein, S0,S1,...,St-1,StIndicate the word in sentence.To the unknown quantity p (S in formulat|I,S0,S1,..., St-1) modeled using recurrent neural network.

(2) recurrent neural network (RNN) is constructed:

T-1 word as under the conditions of, and these vocabularys are shown as to the low-profile h of regular lengtht, until there is newly defeated Enter xt, and low-profile is updated by nonlinear function f, expression formula are as follows:

ht+1=f (ht,xt) (13)

Wherein, ht+1Indicate next low-profile.

(3) it for nonlinear function f, is modeled by the length time memory network (LSTM) for constructing as shown in Figure 3;

Wherein, itFor input gate input gate, ftTo forget door forget gate, otFor out gate output gate, C is cell cell, and the update and output of each state may be expressed as:

it=σ (Wixxt+Wimmt-1) (14)

ft=σ (Wfxxt+Wfmmt-1) (15)

ot=σ (Woxxt+Wommt-1) (16)

pt+1=Softmax (mt) (19)

Wherein,The product being expressed as between gate value, matrix W={ Wix;Wim;Wfx;Wfm;Wox;Wom;Wcx;Wix;Wcm} For need training parameter, σ () be S type function (such as: σ (Wixxt+Wimmt-1)、σ(Wfxxt+Wfmmt-1) be S type function), h () be hyperbolic tangent function (such as: h (Wcxxt+Wcmmt-1) be hyperbolic tangent function).pt+1To classify by Softmax The probability distribution of next word afterwards;mtFor current state feature.

(4) the method optimizing objective function (11) declined using gradient, and the length time memory network after being trained LSTM parameter W.

204: the sequence of frames of video of video to be described, being described using the RNN model that training obtains, obtains description sequence The step of arranging, being predicted is as follows:

(1) test set is extractedNtFor test Collect the number of video, t is test set video, and to each 10 frame image of video extraction, be may be expressed as:

(2) trained model parameter Θ={ W is utilizedi,Bij, i=1 ..., 5, j=1,2,3, and use CNN mould Type extracts ImagetIn each image CNN feature, obtain characteristics of image It={ It 1,…,It 10}。

(3) by characteristics of image ItFormula (12) are acquired as input and using the model parameter W that training obtains, obtain sentence S={ S is described1,…,Sn}.To obtain the corresponding sentence description of the video.

205: using LexRank method to description sequence reasonability be ranked up, select most rationally describe as pair The final description of the video.

(1) by using RNN model to video features sequence It={ It 1,…,It 10Tested, generate corresponding sentence Subclass S={ S1,…Si,…,Sn}。

(2) sentence characteristics are generated, each sentence S in S in all sentence set of sequential scaniIn all words, Middle i=1 ..., Nd, each various words reservation one, form the vocabulary VOL={ w that word list indicatesi,…,wNw, Middle NwIt is the total words in vocabulary VOL.To each word w in vocabulary VOLi, every in sequential scan sentence set S Sentence Sj, count each word wiIn each sentence SjThe frequency n of middle appearanceij, wherein j=1 ..., Ns,NsIt is sentence sum, and It include word w in statistics set SiAdjoint text number num (wi);Each word w is calculated according to formula (20)iIn each sentence Sub- SjIn word frequency tf (wi,sj), wherein i=1 ..., Nd,NdIt is total words in vocabulary, j=1 ..., Ns,NsIt is in set All sentence S sums;

Wherein, nkjThe number occurred in j-th of sentence for k-th of word.

To each word w in vocabulary VOLi, it is calculated against document word frequency idf (w according to formula (21)i);

idf(wi)=log (Nd/num(wi)) (21)

Wherein, NdFor the number of each sentence word.

According to vector space model, by sentence S each in set SjIt is expressed as NwDimensional vector, i-th dimension correspond in vocabulary Word wi, value is tfidf (wi), calculation formula is as follows:

tfidf(wi)=tf (wi,sj)×idf(wi) (22)

(3) two vector S are usedi, SjBetween cosine value as sentence similarity, calculation formula is as follows:

Wherein,It is each word w in sentence SiIn word frequency;It is each word w in sentence SjIn word frequency; idfwFor the inverse document word frequency of each word;smFor sentence SiIn any one word;For word smIn SiIn word frequency;For word smInverse document word frequency;snFor sentence SjIn any one word;For word snIn SjIn word frequency;For word snInverse document word frequency.

And full connection non-directed graph is formed, such as Fig. 4 (a), each node uiFor sentence Si, it is sentence phase that side, which is used as, between node Like degree.

(4) threshold value Degree is set, all similarity similarity are less than to the edge contract of Degree, such as Fig. 4 (b).

(5) each sentence node u is calculatediLexRank score LR, the initial score of each sentence node are as follows: d/N, Middle N is sentence node number, and d is damping factor, and d is generally selected between [0.1,0.2], calculates score LR according to formula (4):

Wherein, deg (v) is the threshold value of node v;LR (u) is the score of node u;LR (v) is the score of node v.

(6) the LR score of each sentence node is calculated, and is sorted, the highest sentence of score finally retouching as video is selected It states.

It is regarded in conclusion the embodiment of the present invention is realized by step 201- step 205 by one section of natural language description Occurent event and thingness relevant to event in frequency are described video content and summarize to reach Purpose.

Embodiment 3

Here two videos are chosen as video to be described, as shown in figure 5, using deep learning and text is based in the present invention The method of this summary carries out prediction to it and exports corresponding video presentation:

(1) it uses ImageNet as training set, each picture in data set is sampled to the figure of 256*256 size Piece,As input, NmFor the number of picture.

(2) build first layer convolutional layer, setting convolution kernel cov1 size is 11, and step-length stride is 4, select ReLU for Max (0, x) carries out pooling operation to the feature map after convolution, and the size of core is 3, and step-length stride is 2, and is made It should be normalized with Local Phase and the data after convolution are normalized.In AlexNet, k=2, n=5, α=10-4, β= 0.75。

(3) second layer convolutional layer is built, setting convolution kernel cov2 size is 5, and step-length stride is 1, selects ReLU for max (0, x) carries out pooling operation to the feature map after convolution, and the size of core is 3, and step-length stride is 2, and uses office Portion, which accordingly normalizes, is normalized the data after convolution.

(4) third layer convolutional layer is built, setting convolution kernel cov3 size is 3, and step-length stride is 1, selects ReLU for max (0,x)。

(5) the 4th layer of convolutional layer is built, setting convolution kernel cov4 size is 3, and step-length stride is 1, selects ReLU for max (0,x)。

(6) layer 5 convolutional layer is built, setting convolution kernel cov5 size is 3, and step-length stride is 1, selects ReLU for max (0, x), and pooling operation is carried out to the feature map after convolution, the size of core is 3, and step-length stride is 2.

(7) the full articulamentum of layer 6 is built, it is fc6 that the layer, which is arranged, selects ReLU for max (0, x), to treated data Carry out dropout.

(8) the full articulamentum of layer 7 is built, it is fc7 that the layer, which is arranged, selects ReLU for max (0, x), to treated data Carry out dropout.

(9) the 8th layer of full articulamentum is built, it is fc8 that the layer, which is arranged, and Softmax classifier is added as objective function.

(10) by the way that above-mentioned eight layer networks layer is arranged, convolutional neural networks (CNN) model is established.

(11) training CNN model parameter.

(12) data processing: each video in data set is uniformly extracted into 10 frames, and samples 256*256 size.And It inputs an image into trained CNN model and obtains characteristics of image, every frame image corresponds to 5 text expressions of the video at random As image-text pair

(13) recurrent neural network (RNN) model is constructed.

Fig. 5 is that generated videotext describes result after the present invention.Picture section in figure is to mention from video The video frame taken, the corresponding sentence of every frame image are video features obtained result after language model.Picture lower half It point indicates after summarizing, migrates retouching for sentence generated and the video script only with video features and by image It states.

In conclusion the frame sequence of each video is passed through convolutional neural networks and circulation nerve net by the embodiment of the present invention Network is converted to a series of sentence, and the method summarized by text, and mass height is screened from numerous sentences and has generation The sentence of table.User can be used this method and obtain the description of video, and the accuracy of description is higher, and can promote Into the retrieval of video.

Bibliography

[1] image classification side of Krizhevsky A, Sutskever I, the Hinton G. based on depth convolutional neural networks Method [J] neural information processing systems progress, 2012.

It will be appreciated by those skilled in the art that attached drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Serial number is for illustration only, does not represent the advantages or disadvantages of the embodiments.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (2)

1. a kind of video presentation method summarized based on deep learning and text, which is characterized in that the video presentation method packet Include following steps:
1) from the Internet download video, and each video is described, formation<video, description>right, constitutes text description instruction Practice collection;
2) by existing image data set according to image classification task training convolutional neural networks model;
3) to video extraction sequence of frames of video, and convolutional neural networks model extraction convolutional neural networks feature is utilized, constitutes < view Frequency frame sequence, text describe sequence > to the input as recurrent neural networks model, and training obtains recurrent neural networks model;
That is, extracting the convolutional neural networks feature and image pair of image using the parameter after training convolutional neural networks model The sentence description answered is modeled, and objective function is obtained;
Construct recurrent neural network;Nonlinear function is modeled by length time memory network;
The method optimizing objective function declined using gradient, and the length time memory network parameter after being trained;
4) sequence of frames of video of video to be described, is described by the recurrent neural networks model that training obtains, is described Sequence;
That is, using trained model parameter and special using the convolutional neural networks of each image of convolutional neural networks model extraction Sign, obtains characteristics of image;
Sentence description is obtained using characteristics of image as input and using the model parameter that training obtains, so that it is corresponding to obtain video Sentence description;
5) method for the conspicuousness summarized by the vocabulary centrad based on figure as text is ranked up description sequence, defeated The final description result of video out;
Wherein, description sequence is ranked up, exports the final description result of video specifically:
By using RNN model to video features sequence It={ It 1,…,It 10Tested, generate corresponding sentence set;
Sentence characteristics are generated, each sentence S in S in all sentence set of sequential scaniIn all words, it is each different single Word retains one, forms the vocabulary that word list indicates;Using two vector Si, SjBetween cosine value it is similar as sentence Degree;Threshold value Degree is set, all similarity similarity are less than to the edge contract of Degree;
Calculate each sentence node uiLexRank score LR, the initial score of each sentence node are as follows: d/N, wherein N be sentence Node number, d are damping factor, and d is generally selected between [0.1,0.2], calculate score LR according to the following formula:
Wherein, deg (v) is the threshold value of node v;LR (u) is the score of node u;LR (v) is the score of node v.
2. a kind of video presentation method summarized based on deep learning and text according to claim 1, which is characterized in that Described to be described from the Internet download video, and to each video, formation<video, description>right constitutes text description training Collection specifically:
Composition<video is described by existing video collection and each video corresponding sentence, description>right constitutes text and retouches State training set.
CN201510697454.3A 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text CN105279495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510697454.3A CN105279495B (en) 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510697454.3A CN105279495B (en) 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text

Publications (2)

Publication Number Publication Date
CN105279495A CN105279495A (en) 2016-01-27
CN105279495B true CN105279495B (en) 2019-06-04

Family

ID=55148479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510697454.3A CN105279495B (en) 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text

Country Status (1)

Country Link
CN (1) CN105279495B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108780464A (en) 2016-03-31 2018-11-09 马鲁巴公司 Method and system for handling input inquiry
CN105894043A (en) * 2016-04-27 2016-08-24 上海高智科技发展有限公司 Method and system for generating video description sentences
CN106126492B (en) * 2016-06-07 2019-02-05 北京高地信息技术有限公司 Sentence recognition methods and device based on two-way LSTM neural network
CN106227793B (en) * 2016-07-20 2019-10-22 优酷网络技术(北京)有限公司 A kind of determination method and device of video and the Video Key word degree of correlation
CN107707931A (en) * 2016-08-08 2018-02-16 阿里巴巴集团控股有限公司 Generated according to video data and explain data, data synthesis method and device, electronic equipment
CN106372107A (en) * 2016-08-19 2017-02-01 中兴通讯股份有限公司 Generation method and device of natural language sentence library
CN106503055B (en) * 2016-09-27 2019-06-04 天津大学 A kind of generation method from structured text to iamge description
CN106485251B (en) * 2016-10-08 2019-12-24 天津工业大学 Egg embryo classification based on deep learning
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106599198A (en) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method for multi-stage connection recurrent neural network
CN106650756B (en) * 2016-12-28 2019-12-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 knowledge migration-based image text description method of multi-mode recurrent neural network
CN106845411A (en) * 2017-01-19 2017-06-13 清华大学 A kind of video presentation generation method based on deep learning and probability graph model
CN106934352A (en) * 2017-02-28 2017-07-07 华南理工大学 A kind of video presentation method based on two-way fractal net work and LSTM
CN106886768A (en) * 2017-03-02 2017-06-23 杭州当虹科技有限公司 A kind of video fingerprinting algorithms based on deep learning
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
US10445871B2 (en) 2017-05-22 2019-10-15 General Electric Company Image analysis neural network systems
CN107291882A (en) * 2017-06-19 2017-10-24 江苏软开信息科技有限公司 A kind of data automatic statistical analysis method
WO2019024083A1 (en) * 2017-08-04 2019-02-07 Nokia Technologies Oy Artificial neural network
CN107609501A (en) * 2017-09-05 2018-01-19 东软集团股份有限公司 The close action identification method of human body and device, storage medium, electronic equipment
CN107818306A (en) * 2017-10-31 2018-03-20 天津大学 A kind of video answering method based on attention model
CN108200483A (en) * 2017-12-26 2018-06-22 中国科学院自动化研究所 Dynamically multi-modal video presentation generation method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442927B2 (en) * 2009-07-30 2013-05-14 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442927B2 (en) * 2009-07-30 2013-05-14 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization;Gunes Erkan;《Journal of Artificial Intelligence Research》;20041204;第22卷(第1期);第457-467页
Translating Videos to Natural Language Using Deep Recurrent Neural Networks;Subhashini Venugopalan等;《Computer Science》;20141219;第3-6页

Also Published As

Publication number Publication date
CN105279495A (en) 2016-01-27

Similar Documents

Publication Publication Date Title
Li et al. Learning query intent from regularized click graphs
Rosé et al. Analyzing collaborative learning processes automatically: Exploiting the advances of computational linguistics in computer-supported collaborative learning
Stein et al. Intrinsic plagiarism analysis
Rohrbach et al. Grounding of textual phrases in images by reconstruction
Cheng et al. Neural summarization by extracting sentences and words
Xie et al. Representation learning of knowledge graphs with entity descriptions
Godin et al. Using topic models for twitter hashtag recommendation
Young et al. Affective news: The automated coding of sentiment in political texts
Cao et al. Deep neural networks for learning graph representations
Jiang et al. Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching.
Bruni et al. Multimodal distributional semantics
Iwata et al. Online multiscale dynamic topic models
Pang et al. Text matching as image recognition
Chen et al. A thorough examination of the cnn/daily mail reading comprehension task
US8510257B2 (en) Collapsed gibbs sampler for sparse topic models and discrete matrix factorization
Hendricks et al. Generating visual explanations
Casamayor et al. Identification of non-functional requirements in textual specifications: A semi-supervised learning approach
Joty et al. Combining intra-and multi-sentential rhetorical parsing for document-level discourse analysis
Boyd-Graber et al. Syntactic topic models
US20120253792A1 (en) Sentiment Classification Based on Supervised Latent N-Gram Analysis
Karpathy et al. Deep visual-semantic alignments for generating image descriptions
US20150310862A1 (en) Deep learning for semantic parsing including semantic utterance classification
Yu et al. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering
Neculoiu et al. Learning text similarity with siamese recurrent networks
EP3166049A1 (en) Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
GR01 Patent grant