CN105279495B

CN105279495B - A kind of video presentation method summarized based on deep learning and text

Info

Publication number: CN105279495B
Application number: CN201510697454.3A
Authority: CN
Inventors: 李广; 马书博; 韩亚洪
Original assignee: Tianjin University
Current assignee: Guangzhou Wellthinker Automation Technology Co ltd
Priority date: 2015-10-23
Filing date: 2015-10-23
Publication date: 2019-06-04
Anticipated expiration: 2035-10-23
Also published as: CN105279495A

Abstract

The invention discloses a kind of video presentation methods summarized based on deep learning and text, comprising: by existing image data set according to image classification task training convolutional neural networks model；To video extraction sequence of frames of video, and convolutional neural networks model extraction convolutional neural networks feature is utilized, compositions<sequence of frames of video, text describes sequence>to the input as recurrent neural networks model, trains and obtains recurrent neural networks model；The sequence of frames of video of video to be described, is described by the recurrent neural networks model that training obtains, obtains description sequence；By the method for the conspicuousness that the vocabulary centrad based on figure is summarized as text, description sequence is ranked up, the final description result of video is exported.By occurent event and thingness relevant to event in one section of video of natural language description, to achieve the purpose that video content is described and summarizes.

Description

Video description method based on deep learning and text summarization

Technical Field

The invention relates to the field of video description, in particular to a video description method based on deep learning and text summarization.

Background

Describing a video in natural language is extremely important both for understanding the video and for retrieving the video on the Web. Meanwhile, the language description of video is also the subject of intensive research in the fields of multimedia and computer vision. The video description means that for a given video, the video characteristics are obtained by observing the contents contained in the video, and corresponding sentences are generated according to the contents. When people see a video, especially videos of some action categories, after the people watch the video, the people can know the video to a certain extent, and can speak things happening in the video through the language. For example: the video is described using the sentence "a person is riding a motorcycle". However, in the case of a large number of videos, a great deal of time, labor and financial resources are required to describe the videos one by one in a manual manner. It is necessary to analyze the video features using computer technology and combine with natural language processing methods to generate descriptions of the video. On one hand, through the video description method, people can more accurately understand the video from the semantic perspective. On the other hand, in the field of video retrieval, it is very difficult and challenging for a user to input a text description to retrieve a corresponding video.

Various video description methods have emerged over the past few years, such as: by analyzing the video characteristics, the objects existing in the video and the action relationship among the objects can be identified. Then, adopting a fixed language template: and the subject + verb + object determines the subject and the object from the recognized objects and takes the action relationship between the objects as a predicate, and the description of the sentence on the video is generated in such a way.

However, such a method has certain limitations, for example, generating sentences by using language templates easily results in relatively fixed sentence patterns of the generated sentences, and the sentence patterns are too single and lack colors expressed by natural human languages. Meanwhile, different characteristics are needed for identifying objects, actions and the like in the video, so that the steps are relatively complicated, and a large amount of time is needed for training the video characteristics. Moreover, the recognition accuracy directly affects the quality of the generated sentences, and the step-by-step method needs to ensure higher correctness at each step and is difficult to implement.

Disclosure of Invention

The invention provides a video description method based on deep learning and text summarization, which describes the happening events and the object attributes related to the events in a section of video through natural language, thereby achieving the purpose of describing and summarizing the video content, and the details are described as follows:

a video description method based on deep learning and text summarization is characterized by comprising the following steps:

downloading videos from the Internet, describing each video to form a pair of < videos, description > and form a text description training set;

training a convolutional neural network model according to an image classification task through an existing image data set;

extracting a video frame sequence from the video, extracting the characteristics of a convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, a text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model;

describing a video frame sequence of a video to be described through a trained recurrent neural network model to obtain a description sequence;

and sequencing the description sequence by a method of taking the word centrality of the graph as the significance of the text summary, and outputting the final description result of the video.

The downloading of videos from the internet and the description of each video form a < video, description > pair, and the forming of a text description training set specifically comprises:

and forming a < video, description > pair by the existing video set and the sentence description corresponding to each video to form a text description training set.

The steps of extracting a video frame sequence from the video, extracting the characteristics of the convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model specifically comprise:

extracting the convolutional neural network characteristics of the image and sentence descriptions corresponding to the image for modeling by using the parameters after the convolutional neural network model is trained to obtain a target function;

constructing a recurrent neural network; modeling the nonlinear function through a long-time memory network;

and optimizing the objective function by using a gradient descent mode, and obtaining the trained long-time and short-time memory network parameters.

The step of describing the video frame sequence of the video to be described by the trained recurrent neural network model to obtain the description sequence specifically comprises the following steps:

extracting the convolutional neural network characteristics of each image by using the trained model parameters and the convolutional neural network model to obtain image characteristics;

and taking the image characteristics as input and obtaining sentence description by using the model parameters obtained by training so as to obtain the sentence description corresponding to the video.

The technical scheme provided by the invention has the beneficial effects that: each video is composed of a frame sequence, the bottom layer characteristics of each frame of the video are extracted by using a convolutional neural network, and by adopting the method, excessive noise points caused by the traditional method of extracting the video characteristics by using deep learning can be effectively avoided, and the accuracy of generating sentences at the later stage is reduced. Each frame picture is converted into a sentence using a trained recurrent neural network, thereby generating a set of sentences. And the method for automatically summarizing the text is used for screening out high-quality and representative sentences from the sentence set by calculating the centrality between the sentences as the description of the video, and the method can generate better video description effect and accuracy and sentence diversity. Meanwhile, the method based on the depth and text summarization can be effectively popularized to the application of video retrieval, but the method is only limited to English description of video content.

Drawings

FIG. 1 is a flow chart of a video description method based on deep learning and text summarization;

FIG. 2 is a schematic diagram of a convolutional neural network model (CNN) used in the present invention;

wherein Cov represents a convolution kernel; ReLU is expressed by the formula max (0, x); pool stands for Pooling operation; LRN is local corresponding normalization operation; softmax is the objective function.

FIG. 3 is a schematic diagram of a recurrent neural network used in the present invention;

wherein t represents the input in the t state; h is_t-1A hidden state representing a previous state; i is input gate; f is forgetgate; o is output gate; c is a cell; m is_tIs the output after passing through an LSTM unit.

FIG. 4(a) is a drawing showing a LexRank pruned connection;

wherein S ═ { S ═ S₁,…,S₁₀10 sentences generated by a Recurrent Neural Network (RNN) are represented as 10 nodes by adopting a graph mode; the similarity between the nodes is represented by straight lines and forms a full-connection graph, and the thickness of the connecting lines represents the size of the similarity.

FIG. 4(b) is an initial full-link diagram of LexRank;

by setting a threshold, the connecting lines with small similarity between the nodes are removed, and the connecting lines between the rest nodes, namely the similarity between sentences is high.

Fig. 5 is a schematic diagram of a sentence generated after a part of a video frame is described.

Wherein, each frame of image is followed by a sentence generated by adopting the CNN-RNN model used in the invention, and the arrow pointing part of the sentence is the summary of the video text description after LexRank method and is used as the text description of the video.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Based on the problems in the background art and after the description effect of the image is remarkably improved by using the deep learning method in the image, people are inspired from the image, and the diversity and the correctness of the generated video description are improved to a certain extent by using the deep learning method in the video.

The embodiment of the invention provides a video description method based on deep learning and text summarization. Each video feature is then taken as input into a recurrent neural network framework with which a sentence description can be generated for each visual feature, i.e. each frame of the video. Thus, a sentence set is obtained, in order to obtain the most expressive and high-quality sentences as the description of the video, the method adopts a text summarization method, and all sentences are sequenced by calculating the similarity between the sentences, so that some wrong sentences and low-quality sentences are avoided as the final description of the video. By adopting the automatic text summarization method, a representative sentence can be obtained, and certain correctness and reliability are achieved, so that the accuracy of video description is improved. Meanwhile, the method also overcomes some technical difficulties faced by video retrieval.

Example 1

A video description method based on deep learning and text summarization, referring to fig. 1, the method comprising the steps of:

101: downloading videos from the Internet, describing each video (English description) to form a < video, description > pair, and forming a text description training set, wherein each video corresponds to a plurality of sentence descriptions, so that a text description sequence is formed;

102: training a Convolutional Neural Network (CNN) model according to an image classification task by using the existing image data set;

for example: ImageNet.

103: extracting a video frame sequence from a video, extracting CNN characteristics by using a Convolutional Neural Network (CNN) model, forming a < video frame sequence, a text description sequence > as an input of a Recurrent Neural Network (RNN) model, and training to obtain the Recurrent Neural Network (RNN) model;

104: describing a video frame sequence of a video to be described by using the RNN model obtained by training to obtain a description sequence;

105: the reasonableness of the description sequence is ranked by using a method based on the lexical centrality of the graph as the significance (LexRank) of the text summary, and the most reasonable description is selected as the final description of the video.

In summary, the embodiments of the present invention implement, through steps 101 to 105, to describe an event occurring in a segment of video and an object attribute related to the event through a natural language, so as to achieve the purpose of describing and summarizing video content.

Example 2

201: downloading images from the Internet, describing each video to form a pair of < video and description >, and forming a text description training set;

the method specifically comprises the following steps:

(1) downloading Microsoft Research Video description data set (Microsoft Research Video) from the Internet Description Corpus), this data set comprising 1970 video segments collected from YouTube, which data set may representIs composed ofWherein N is_dIs the total number of videos in the aggregate VID.

(2) Each video has a plurality of corresponding descriptions, and each Sentence of the video is described as "sequences ═ sequences₁,…,Sentence_NN represents a Sentence corresponding to each video (sequence)₁,…,Sentence_N) The number of descriptions of (c).

(3) The < video, description > pair is formed by the existing video set VID and the sentence description sequences corresponding to each video, and a text description training set is formed.

202: training a Convolutional Neural Network (CNN) model according to an image classification task by using the existing image data set, and training CNN model parameters;

the method specifically comprises the following steps:

(1) constructing AlexNet [1] CNN model shown in FIG. 2: the model comprises 8 network layers, wherein the first 5 layers are convolutional layers, and the last 3 layers are full-connection layers.

(2) Using Imagenet as a training set, each picture in the image dataset was sampled to 256 × 256 sizeThe picture of (a) is displayed on the screen,as input, N_mIs the number, root, of the pictureAccording to the network layer set in fig. 2, layer 1 can be expressed as:

F₁(IMAGE)＝norm{pool[max(0,W₁*IMAGE+B₁)]} (1)

wherein IMAGE represents an input IMAGE; w₁Representing convolution kernel parameters; b is₁Represents a bias; f₁(IMAGE) is expressed as an output after passing through the first layer network; norm denotes the normalization operation. At this pointIn the network layer, by a linear correction function (max (0, x), x is W₁*IMAGE+B₁) Processing the convolved image, performing a mapping pool operation, and performing local corresponding normalization (LRN) on the convolved image, wherein the normalization mode is as follows:

wherein M is the number of feature maps after posing; i is the ith of the M feature maps; n is the size of local normalization, namely, normalization is carried out on every n feature maps; a isⁱ _x,yRepresenting the corresponding value at coordinate (x, y) in the ith feature map, k being the offset, α being the normalized parameter, bⁱ _x,yIs the output result after local corresponding normalization (LRN).

In AlexNet, k is 2, n is 5, α is 10^-4，β＝0.75。

Continuing to use the model, F₁(IMAGE) as input to the second network layer, according to the second layer network layer, can be expressed as:

F₂(IMAGE)＝max(0,W₂*F₁(IMAGE)+B₂) (3)

wherein, W₂Representing convolution kernel parameters; b is₂Represents a bias; f₂(IMAGE) is expressed as the output after the second layer network. The first layer and the second layer are arranged in the same way, and the sizes of the mapping kernels of the convolution layer and the posing layer are changed.

According to the network setup of AlexNet, the remaining convolutional layers can be represented in turn as:

F₃(IMAGE)＝max(0,W₃*F₂(IMAGE)+B₃) (4)

F₄(IMAGE)＝max(0,W₄*F₃(IMAGE)+B₄) (5)

F₅(IMAGE)＝pool[max(0,W₅*F₄(IMAGE)+B₅)] (6)

wherein, W₃，W₄，W₅And B₃，B₄，B₅The convolution parameters and the bias for each layer.

The last 3 layers are fully connected layers, and the network layer settings according to fig. 2 can be expressed in turn as:

F₆(IMAGE)＝fc[F₅(IMAGE),θ₁] (7)

F₇(IMAGE)＝fc[F₆(IMAGE),θ₂] (8)

F₈(IMAGE)＝fc[F₇(IMAGE),θ₃] (9)

wherein fc represents the full connection layer, θ₁，θ₂，θ₃Parameters of three fully-connected layers are expressed, and the characteristic F of the last layer is expressed₈(IMAGE) input to a multivariate classifier of 1000 classes for classification.

(3) According to the current network, a multivariate classifier is set, and the formula can be expressed as:

wherein l (Θ) is an objective function, m is the category of the image in Imagenet, and x^(t)CNN features, y, extracted for each class after passing through Alexnet network^(t)For each image corresponding label, Θ ═ W_p,B_p,θ_q1, 5, and q 1,2,3, which are parameters in each network layer. And optimizing the target function parameters by adopting a gradient descent method, thereby obtaining the parameter theta set by the Alexnet network.

203: extracting a video frame sequence from a video, extracting CNN characteristics by using a Convolutional Neural Network (CNN) model, forming a < video frame sequence, a text description sequence > as an input of a Recurrent Neural Network (RNN) model, and training to obtain the Recurrent Neural Network (RNN) model;

the method comprises the following steps:

(1) according to step 201, using the parameters after the CNN model training, extracting the CNN feature I of the image and the sentence description S corresponding to the image for modeling, wherein the objective function is as follows:

θ^*＝argmax∑logp(S|I；θ) (11)

wherein (S, I) represents an image-text pair in the training data; theta is a parameter to be optimized of the model; theta is the optimized parameter;

the training aims to maximize the sum of the logarithmic probabilities of the sentences generated by all samples under observation of a given input image I, and the conditional probability chain rule is used to calculate the probability p (SI; theta), where the expression is:

wherein S is₀,S₁,...,S_t-1,S_tRepresenting words in a sentence. For the unknown quantity p (S) in the formula_t|I,S₀,S₁,...,S_t-1) Modeling was performed using a recurrent neural network.

(2) Constructing a Recurrent Neural Network (RNN):

with t-1 words as conditions, and representing the words as a hidden state h of fixed length_tUntil a new input x appears_tAnd updating the hidden state through a nonlinear function f, wherein the expression is as follows:

h_t+1＝f(h_t,x_t) (13)

wherein h is_t+1Indicating the next hidden state.

(3) For the nonlinear function f, modeling is performed by constructing a long-time memory network (LSTM) as shown in FIG. 3;

wherein i_tFor inputting gate input gate, f_tTo forget the door for gate, o_tTo output the gate output gate, c is the cell, and the update and output of each state can be expressed as:

i_t＝σ(W_ixx_t+W_imm_t-1) (14)

f_t＝σ(W_fxx_t+W_fmm_t-1) (15)

o_t＝σ(W_oxx_t+W_omm_t-1) (16)

p_t+1＝Softmax(m_t) (19)

wherein,expressed as the product between gate values, the matrix W ═ W_ix；W_im；W_fx；W_fm；W_ox；W_om；W_cx；W_ix；W_cmIs the parameter to be trained, and σ (-) is an sigmoid function (e.g., σ (W)_ixx_t+W_imm_t-1)、σ(W_fxx_t+W_fmm_t-1) As an sigmoid function), h (-) is a hyperbolic tangent function (e.g.: h (W)_cxx_t+W_cmm_t-1) As a hyperbolic tangent function). p is a radical of_t+1Is the probability distribution of the next word after the Softmax classification; m is_tIs a current state feature.

(4) And optimizing the objective function (11) by using a gradient descent mode, and obtaining a trained long-time and short-time memory network LSTM parameter W.

204: describing a video frame sequence of a video to be described by using the RNN model obtained by training to obtain a description sequence, wherein the predicting step comprises the following steps:

(1) extracting test setsN_tTo testThe number of the video sets, t, is the video set under test, and 10 frames of images are extracted from each video set, which can be expressed as:

(2) using the trained model parameters theta ═ { W }_i,B_i,θ_j1, 5, j 1,2,3, and extracting Image using a CNN model^tThe CNN characteristic of each image is obtained to obtain an image characteristic I^t＝{I^t ₁,…,I^t ₁₀}。

(3) Characterizing an image I^tUsing the model parameters W obtained by training as input, equation (12) is obtained, and the sentence description S ═ S is obtained₁,…,S_n}. Thereby obtaining the sentence description corresponding to the video.

205: and sorting the reasonableness of the description sequence by using a LexRank method, and selecting the most reasonable description as a final description of the video.

(1) Video feature sequence I by using RNN model^t＝{I^t ₁,…,I^t ₁₀Testing to generate corresponding sentence set S ═ S₁,…S_i,…,S_n}。

(2) Generating sentence featuresCharacteristically, each sentence S in all sentence sets is scanned sequentially_iWherein i is 1, …, N_dOne for each different word, forming a vocabulary of word list representations VOL ═ w_i,…,w_NwIn which N is_wIs the total number of words in the vocabulary VOL. For each word w in the vocabulary VOL_iSequentially scanning each sentence S in the set S of sentences_jCounting each word w_iIn each sentence S_jNumber of occurrences n_ijWhere j is 1, …, N_s,N_sIs the total number of sentences and counts the words w contained in the set S_iNumber of accompanying text num (w)_i) (ii) a Calculate each word w according to equation (20)_iIn each sentence S_jWord frequency tf (w) of (1)_i,s_j) Where i is 1, …, N_d,N_dIs the total number of words in the vocabulary, j is 1, …, N_s,N_sIs the total number of all sentences S in the set;

wherein n is_kjIs the number of occurrences of the kth word in the jth sentence.

For each word w in the vocabulary VOL_iCalculating the inverse document word frequency idf (w) according to the formula (21)_i)；

idf(w_i)＝log(N_d/num(w_i)) (21)

Wherein N is_dAs the number of words per sentence.

According to the vector space model, each sentence S in the set S_jIs expressed as N_wDimension vector, i-th dimension corresponding to word w in the vocabulary_iThe value of tfidf (w)_i) The calculation formula is as follows:

tfidf(w_i)＝tf(w_i,s_j)×idf(w_i) (22)

(3) using two vectors S_i，S_jThe cosine value between them is used as sentence similarity, and the calculation formula is as follows:

wherein,in sentence S for each word w_iThe word frequency of (1);in sentence S for each word w_jThe word frequency of (1); idf_wAn inverse document word frequency for each word; s_mAs a sentence S_iAny one of the words;as a word s_mAt S_iThe word frequency of (1);as a word s_mThe inverse document word frequency of; s_nAs a sentence S_jAny one of the words;as a word s_nAt S_jThe word frequency of (1);as a word s_nThe inverse document word frequency.

And form a fully connected undirected graph, as in FIG. 4(a), with each node u_iAs a sentence S_iAnd the edges between the nodes are taken as sentence similarity.

(4) A threshold value Degree is set, and all edges with similarity less than Degree are deleted, as shown in fig. 4 (b).

(5) Calculate each sentence node u_iThe LexRank score LR, the initial score of each sentence node is: d/N, where N is the number of sentence nodes, d is the damping factor, and d is usually selected to be [0.1,0.2 ]]The score LR is calculated according to equation (4):

wherein deg (v) is the threshold of node v; LR (u) is the score for node u; LR (v) is the score for node v.

(6) And calculating the LR score of each sentence node, sequencing, and selecting the sentence with the highest score as the final description of the video.

In summary, the embodiment of the present invention, through steps 201 to 205, describes an event occurring in a segment of video and an object attribute related to the event through a natural language, so as to achieve the purpose of describing and summarizing video content.

Example 3

Two videos are selected as videos to be described, and as shown in fig. 5, the videos are predicted by using the method based on deep learning and text summarization in the present invention to output corresponding video descriptions:

(1) using ImageNet as a training set, each picture in the data set was sampled to 256 × 256 size picturesThe sheet is a sheet of a plastic material,as input, N_mThe number of pictures.

(2) Building a first layer of convolution layer, setting the size of a convolution kernel cov1 as 11, the step length stride as 4, selecting ReLU as max (0, x), performing posing operation on the convolved feature map, setting the size of the kernel as 3 and the step length stride as 2, and using local corresponding normalizationThe convolved data are normalized, in AlexNet, k is 2, n is 5, and α is 10^-4,β＝0.75。

(3) And building a second layer of convolution layer, setting the size of a convolution kernel cov2 as 5, setting the step length stride as 1, selecting ReLU as max (0, x), performing posing operation on the convolved feature map, setting the size of the kernel as 3 and the step length stride as 2, and normalizing the convolved data by using local corresponding normalization.

(4) And building a third layer of convolution layer, setting the size of a convolution kernel cov3 to be 3, the step length stride to be 1, and selecting ReLU to be max (0, x).

(5) And building a fourth layer of convolution layer, setting the size of a convolution kernel cov4 to be 3, the step length stride to be 1, and selecting ReLU to be max (0, x).

(6) And building a fifth layer convolution layer, setting the size of a convolution kernel cov5 to be 3, the step length stride to be 1, selecting ReLU to be max (0, x), and performing posing operation on the convolved feature map, wherein the size of the kernel is 3, and the step length stride to be 2.

(7) And building a sixth full connection layer, setting the layer as fc6, selecting ReLU as max (0, x), and performing dropout on the processed data.

(8) And building a seventh full connection layer, setting the layer as fc7, selecting ReLU as max (0, x), and performing dropout on the processed data.

(9) And building an eighth fully-connected layer, setting the layer to fc8, and adding a Softmax classifier as an objective function.

(10) And establishing a Convolutional Neural Network (CNN) model by setting the eight-layer network layers.

(11) And training CNN model parameters.

(12) Data processing: each video in the data set was uniformly extracted into 10 frames and sampled to 256 x 256 size. Inputting the image into a trained CNN model to obtain image characteristics, wherein each frame of image randomly corresponds to 5 sentences of text expression of the video to be used as image-text pairs

(13) A Recurrent Neural Network (RNN) model is constructed.

Fig. 5 is a video text description result generated after the invention. The image in the figure is divided into video frames extracted from a video, and sentences corresponding to each frame of image are the results obtained after the video features pass through a language model. The lower part of the picture represents the sentence generated by the video feature and the image migration and the description of the video script only after being summarized.

In summary, the embodiment of the present invention converts the frame sequence of each video into a series of sentences through the convolutional neural network and the cyclic neural network, and selects high-quality and representative sentences from a plurality of sentences through a text summarization method. The user can obtain the description of the video by using the method, the description accuracy is high, and the method can be popularized to the retrieval of the video.

Reference to the literature

[1] Krizhevsky a, sutsker I, Hinton g. image classification method based on deep convolutional neural networks [ J ] neural information processing system evolution, 2012.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A video description method based on deep learning and text summarization is characterized by comprising the following steps:

1) downloading videos from the Internet, describing each video to form a pair of < videos, description > and form a text description training set;

2) training a convolutional neural network model according to an image classification task through an existing image data set;

3) extracting a video frame sequence from the video, extracting the characteristics of a convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, a text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model;

extracting the convolutional neural network characteristics of the image and sentence descriptions corresponding to the image for modeling by using the parameters after training the convolutional neural network model to obtain a target function;

optimizing the objective function by using a gradient descent mode, and obtaining a trained long-time and short-time memory network parameter;

4) describing a video frame sequence of a video to be described through a trained recurrent neural network model to obtain a description sequence;

the image characteristics are used as input, sentence description is obtained by using model parameters obtained by training, and accordingly sentence description corresponding to the video is obtained;

5) sequencing the description sequence by using a method based on the vocabulary centrality of the graph as the significance of the text summary, and outputting the final description result of the video;

the description sequence is sequenced, and the final description result of the output video specifically includes:

video feature sequence I by using RNN model^t＝{I^t ₁,…,I^t ₁₀Testing to generate a corresponding sentence set;

generating sentence characteristics, and sequentially scanning each sentence S in all sentence sets_iOne for each different word in all the words in (1), and forming a vocabulary table represented by a word list; using two vectors S_i，S_jThe cosine value between them is used as sentence similarity; setting a threshold value Degree, and deleting all edges with similarity smaller than Degree;

calculate each sentence node u_iLexRank score LR, initial of each sentence nodeThe fraction is: d/N, where N is the number of sentence nodes, d is the damping factor, and d is usually selected to be [0.1,0.2 ]]The score LR is calculated according to the following formula:

2. The video description method based on deep learning and text summarization as claimed in claim 1, wherein the downloading of videos from the internet and the description of each video form < video, description > pairs, and the forming of the text description training set specifically comprises: