CN105279495A - Video description method based on deep learning and text summarization - Google Patents

Video description method based on deep learning and text summarization Download PDF

Info

Publication number
CN105279495A
CN105279495A CN201510697454.3A CN201510697454A CN105279495A CN 105279495 A CN105279495 A CN 105279495A CN 201510697454 A CN201510697454 A CN 201510697454A CN 105279495 A CN105279495 A CN 105279495A
Authority
CN
China
Prior art keywords
video
description
neural network
network model
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510697454.3A
Other languages
Chinese (zh)
Other versions
CN105279495B (en
Inventor
李广
马书博
韩亚洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Wellthinker Automation Technology Co ltd
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201510697454.3A priority Critical patent/CN105279495B/en
Publication of CN105279495A publication Critical patent/CN105279495A/en
Application granted granted Critical
Publication of CN105279495B publication Critical patent/CN105279495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video description method based on deep learning and text summarization. The video description method comprises the following steps: through a traditional image data set, training a convolutional neural network model according to an image classification task; extracting a video frame sequence of a video, utilizing the convolutional neural network model to extract convolutional neural network characteristics to form a <video frame sequence, text description sequence> pair which is used as the input of a recurrent neural network model, and training to obtain the recurrent neural network model; describing the video frame sequence of the video to be described through the recurrent neural network model obtained by training to obtain description sequences; and through a method that graph-based vocabulary centrality is used as the significance of the text summarization, sorting the description sequences, and outputting a final description result of the video. An event which happens in one video and object attributes associated with the event are described through a natural language so as to achieve a purpose that video contents are described and summarized.

Description

Video description method based on deep learning and text summarization
Technical Field
The invention relates to the field of video description, in particular to a video description method based on deep learning and text summarization.
Background
Describing a video in natural language is extremely important both for understanding the video and for retrieving the video on the Web. Meanwhile, the language description of video is also the subject of intensive research in the fields of multimedia and computer vision. The video description means that for a given video, the video characteristics are obtained by observing the contents contained in the video, and corresponding sentences are generated according to the contents. When people see a video, especially videos of some action categories, after the people watch the video, the people can know the video to a certain extent, and can speak things happening in the video through the language. For example: the video is described using the sentence "a person is riding a motorcycle". However, in the case of a large number of videos, a great deal of time, labor and financial resources are required to describe the videos one by one in a manual manner. It is necessary to analyze the video features using computer technology and combine with natural language processing methods to generate descriptions of the video. On one hand, through the video description method, people can more accurately understand the video from the semantic perspective. On the other hand, in the field of video retrieval, it is very difficult and challenging for a user to input a text description to retrieve a corresponding video.
Various video description methods have emerged over the past few years, such as: by analyzing the video characteristics, the objects existing in the video and the action relationship among the objects can be identified. Then, adopting a fixed language template: and the subject + verb + object determines the subject and the object from the recognized objects and takes the action relationship between the objects as a predicate, and the description of the sentence on the video is generated in such a way.
However, such a method has certain limitations, for example, generating sentences by using language templates easily results in relatively fixed sentence patterns of the generated sentences, and the sentence patterns are too single and lack colors expressed by natural human languages. Meanwhile, different characteristics are needed for identifying objects, actions and the like in the video, so that the steps are relatively complicated, and a large amount of time is needed for training the video characteristics. Moreover, the recognition accuracy directly affects the quality of the generated sentences, and the step-by-step method needs to ensure higher correctness at each step and is difficult to implement.
Disclosure of Invention
The invention provides a video description method based on deep learning and text summarization, which describes the happening events and the object attributes related to the events in a section of video through natural language, thereby achieving the purpose of describing and summarizing the video content, and the details are described as follows:
a video description method based on deep learning and text summarization is characterized by comprising the following steps:
downloading videos from the Internet, describing each video to form a pair of < videos, description > and form a text description training set;
training a convolutional neural network model according to an image classification task through an existing image data set;
extracting a video frame sequence from the video, extracting the characteristics of a convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, a text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model;
describing a video frame sequence of a video to be described through a trained recurrent neural network model to obtain a description sequence;
and sequencing the description sequence by a method of taking the word centrality of the graph as the significance of the text summary, and outputting the final description result of the video.
The downloading of videos from the internet and the description of each video form a < video, description > pair, and the forming of a text description training set specifically comprises:
and forming a < video, description > pair by the existing video set and the sentence description corresponding to each video to form a text description training set.
The steps of extracting a video frame sequence from the video, extracting the characteristics of the convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model specifically comprise:
extracting the convolutional neural network characteristics of the image and sentence descriptions corresponding to the image for modeling by using the parameters after the convolutional neural network model is trained to obtain a target function;
constructing a recurrent neural network; modeling the nonlinear function through a long-time memory network;
and optimizing the objective function by using a gradient descent mode, and obtaining the trained long-time and short-time memory network parameters.
The step of describing the video frame sequence of the video to be described by the trained recurrent neural network model to obtain the description sequence specifically comprises the following steps:
extracting the convolutional neural network characteristics of each image by using the trained model parameters and the convolutional neural network model to obtain image characteristics;
and taking the image characteristics as input and obtaining sentence description by using the model parameters obtained by training so as to obtain the sentence description corresponding to the video.
The technical scheme provided by the invention has the beneficial effects that: each video is composed of a frame sequence, the bottom layer characteristics of each frame of the video are extracted by using a convolutional neural network, and by adopting the method, excessive noise points caused by the traditional method of extracting the video characteristics by using deep learning can be effectively avoided, and the accuracy of generating sentences at the later stage is reduced. Each frame picture is converted into a sentence using a trained recurrent neural network, thereby generating a set of sentences. And the method for automatically summarizing the text is used for screening out high-quality and representative sentences from the sentence set by calculating the centrality between the sentences as the description of the video, and the method can generate better video description effect and accuracy and sentence diversity. Meanwhile, the method based on the depth and text summarization can be effectively popularized to the application of video retrieval, but the method is only limited to English description of video content.
Drawings
FIG. 1 is a flow chart of a video description method based on deep learning and text summarization;
FIG. 2 is a schematic diagram of a convolutional neural network model (CNN) used in the present invention;
wherein Cov represents a convolution kernel; ReLU is expressed by the formula max (0, x); pool stands for Pooling operation; LRN is local corresponding normalization operation; softmax is the objective function.
FIG. 3 is a schematic diagram of a recurrent neural network used in the present invention;
wherein t represents the input in the t state; h ist-1A hidden state representing a previous state; i is inputgate; f is forgetgate; o is output gate; c is a cell; m istIs the output after passing through an LSTM unit.
FIG. 4(a) is a drawing showing a LexRank pruned connection;
wherein S ═ { S ═ S1,…,S1010 sentences generated by a Recurrent Neural Network (RNN) are represented as 10 nodes by adopting a graph mode; the similarity between the nodes is represented by straight lines and forms a full-connection graph, and the thickness of the connecting lines represents the size of the similarity.
FIG. 4(b) is an initial full-link diagram of LexRank;
by setting a threshold, the connecting lines with small similarity between the nodes are removed, and the connecting lines between the rest nodes, namely the similarity between sentences is high.
Fig. 5 is a schematic diagram of a sentence generated after a part of a video frame is described.
Wherein, each frame of image is followed by a sentence generated by adopting the CNN-RNN model used in the invention, and the arrow pointing part of the sentence is the summary of the video text description after LexRank method and is used as the text description of the video.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Based on the problems in the background art and after the description effect of the image is remarkably improved by using the deep learning method in the image, people are inspired from the image, and the diversity and the correctness of the generated video description are improved to a certain extent by using the deep learning method in the video.
The embodiment of the invention provides a video description method based on deep learning and text summarization. Each video feature is then taken as input into a recurrent neural network framework with which a sentence description can be generated for each visual feature, i.e. each frame of the video. Thus, a sentence set is obtained, in order to obtain the most expressive and high-quality sentences as the description of the video, the method adopts a text summarization method, and all sentences are sequenced by calculating the similarity between the sentences, so that some wrong sentences and low-quality sentences are avoided as the final description of the video. By adopting the automatic text summarization method, a representative sentence can be obtained, and certain correctness and reliability are achieved, so that the accuracy of video description is improved. Meanwhile, the method also overcomes some technical difficulties faced by video retrieval.
Example 1
A video description method based on deep learning and text summarization, referring to fig. 1, the method comprising the steps of:
101: downloading videos from the Internet, describing each video (English description) to form a < video, description > pair, and forming a text description training set, wherein each video corresponds to a plurality of sentence descriptions, so that a text description sequence is formed;
102: training a Convolutional Neural Network (CNN) model according to an image classification task by using the existing image data set;
for example: ImageNet.
103: extracting a video frame sequence from a video, extracting CNN characteristics by using a Convolutional Neural Network (CNN) model, forming a < video frame sequence, a text description sequence > as an input of a Recurrent Neural Network (RNN) model, and training to obtain the Recurrent Neural Network (RNN) model;
104: describing a video frame sequence of a video to be described by using the RNN model obtained by training to obtain a description sequence;
105: the reasonableness of the description sequence is ranked by using a method based on the lexical centrality of the graph as the significance (LexRank) of the text summary, and the most reasonable description is selected as the final description of the video.
In summary, the embodiments of the present invention implement, through steps 101 to 105, to describe an event occurring in a segment of video and an object attribute related to the event through a natural language, so as to achieve the purpose of describing and summarizing video content.
Example 2
201: downloading images from the Internet, describing each video to form a pair of < video and description >, and forming a text description training set;
the method specifically comprises the following steps:
(1) downloading from the Internet a Microsoft research institute video description data set (Microsoft research video DescriptionCorpus) comprising 1970 video segments collected from YouTube, which may be represented as a data set V I D = { Video 1 , ... , Video N d } , Wherein N isdIs a video in the set VIDAnd (4) total number.
(2) Each video has a plurality of corresponding descriptions, and each Sentence of the video is described as "sequences ═ sequences1,…,SentenceNN represents a Sentence corresponding to each video (sequence)1,…,SentenceN) The number of descriptions of (c).
(3) The < video, description > pair is formed by the existing video set VID and the sentence description sequences corresponding to each video, and a text description training set is formed.
202: training a Convolutional Neural Network (CNN) model according to an image classification task by using the existing image data set, and training CNN model parameters;
the method specifically comprises the following steps:
(1) constructing AlexNet [1] CNN model shown in FIG. 2: the model comprises 8 network layers, wherein the first 5 layers are convolutional layers, and the last 3 layers are full-connection layers.
(2) Using Imagenet as a training set, each picture in the image dataset was sampled to 256 x 256 size pictures, I M A G E = { Image 1 , ... , Image N m } as input, NmIs a pictureThe number of layers 1, according to the network layer set in fig. 2, can be expressed as:
F1(IMAGE)=norm{pool[max(0,W1*IMAGE+B1)]}(1)
wherein IMAGE represents an input IMAGE; w1Representing convolution kernel parameters; b is1Represents a bias; f1(IMAGE) is expressed as an output after passing through the first layer network; norm denotes the normalization operation. In this network layer, x is W by a linear correction function (max (0, x), x is1*IMAGE+B1) Processing the convolved image, performing a mapping pool operation, and performing local corresponding normalization (LRN) on the convolved image, wherein the normalization mode is as follows:
b i x , y = a i x , y / ( k + &alpha; &Sigma; j = max ( 0 , i - n / 2 ) min ( M - 1 , i + n / 2 ) ( a j x , y ) 2 ) &beta; - - - ( 2 )
wherein M is the number of feature maps after posing; i is the ith of the M feature maps; n is the size of local normalization, namely, normalization is carried out on every n feature maps; a isi x,yRepresenting the corresponding value at coordinate (x, y) in the ith feature map, k being the offset, α being the normalized parameter, bi x,yIs the output result after local corresponding normalization (LRN).
In AlexNet, k is 2, n is 5, α is 10-4,β=0.75。
Continuing to use the model, F1(IMAGE) as input to the second network layer, according to the second layer network layer, can be expressed as:
F2(IMAGE)=max(0,W2*F1(IMAGE)+B2)(3)
wherein, W2Representing convolution kernel parameters; b is2Represents a bias; f2(IMAGE) is expressed as the output after the second layer network. The first layer and the second layer are arranged identically, except the mapping core of the convolution layer and the posing layerThe size of the kernel varies.
According to the network setup of AlexNet, the remaining convolutional layers can be represented in turn as:
F3(IMAGE)=max(0,W3*F2(IMAGE)+B3)(4)
F4(IMAGE)=max(0,W4*F3(IMAGE)+B4)(5)
F5(IMAGE)=pool[max(0,W5*F4(IMAGE)+B5)](6)
wherein, W3,W4,W5And B3,B4,B5The convolution parameters and the bias for each layer.
The last 3 layers are fully connected layers, and the network layer settings according to fig. 2 can be expressed in turn as:
F6(IMAGE)=fc[F5(IMAGE),θ1](7)
F7(IMAGE)=fc[F6(IMAGE),θ2](8)
F8(IMAGE)=fc[F7(IMAGE),θ3](9)
wherein fc represents the full connection layer, θ1,θ2,θ3Parameters of three fully-connected layers are expressed, and the characteristic F of the last layer is expressed8(IMAGE) input to a multivariate classifier of 1000 classes for classification.
(3) According to the current network, a multivariate classifier is set, and the formula can be expressed as:
l ( &Theta; ) = &Sigma; t = 1 m log p ( y ( t ) | x ( t ) ; &Theta; ) - - - ( 10 )
wherein l (Θ) is an objective function, m is the category of the image in Imagenet, and x(t)CNN features, y, extracted for each class after passing through Alexnet network(t)For each image corresponding label, Θ ═ Wp,Bpq1, 5, and q 1,2,3, which are parameters in each network layer. And optimizing the target function parameters by adopting a gradient descent method, thereby obtaining the parameter theta set by the Alexnet network.
203: extracting a video frame sequence from a video, extracting CNN characteristics by using a Convolutional Neural Network (CNN) model, forming a < video frame sequence, a text description sequence > as an input of a Recurrent Neural Network (RNN) model, and training to obtain the Recurrent Neural Network (RNN) model;
the method comprises the following steps:
(1) according to step 201, using the parameters after the CNN model training, extracting the CNN feature I of the image and the sentence description S corresponding to the image for modeling, wherein the objective function is as follows:
θ*=argmax∑logp(S|I;θ)
(11)
wherein (S, I) represents an image-text pair in the training data; theta is a parameter to be optimized of the model; theta is the optimized parameter;
the training aims to maximize the sum of the logarithmic probabilities of the sentences generated by all samples under observation of a given input image I, and the conditional probability chain rule is used to calculate the probability p (SI; theta), where the expression is:
log p ( S | I ) = &Sigma; t = 0 N log p ( S t | I , S 0 , S 1 , ... , S t - 1 ) - - - ( 12 )
wherein S is0,S1,...,St-1,StRepresenting words in a sentence. For the unknown quantity p (S) in the formulat|I,S0,S1,...,St-1) Modeling was performed using a recurrent neural network.
(2) Constructing a Recurrent Neural Network (RNN):
with t-1 words as conditions, and representing the words as a hidden state h of fixed lengthtUntil a new input x appearstAnd updating the hidden state through a nonlinear function f, wherein the expression is as follows:
ht+1=f(ht,xt)(13)
wherein h ist+1Indicating the next hidden state.
(3) For the nonlinear function f, modeling is performed by constructing a long-time memory network (LSTM) as shown in FIG. 3;
wherein itFor inputting gate inputgate, ftTo forget the door for getgate, otFor the output gate output, c is the cell, and the update and output of each state can be expressed as:
it=σ(Wixxt+Wimmt-1)(14)
ft=σ(Wfxxt+Wfmmt-1)(15)
ot=σ(Woxxt+Wommt-1)(16)
pt+1=Softmax(mt)(19)
wherein,expressed as the product between gate values, the matrix W ═ Wix;Wim;Wfx;Wfm;Wox;Wom;Wcx;Wix;WcmIs the parameter to be trained, and σ (-) is an sigmoid function (e.g., σ (W)ixxt+Wimmt-1)、σ(Wfxxt+Wfmmt-1) As an sigmoid function), h (-) is a hyperbolic tangent function (e.g.: h (W)cxxt+Wcmmt-1) As a hyperbolic tangent function). p is a radical oft+1Is the probability distribution of the next word after the Softmax classification; m istIs a current state feature.
(4) And optimizing the objective function (11) by using a gradient descent mode, and obtaining a trained long-time and short-time memory network LSTM parameter W.
204: describing a video frame sequence of a video to be described by using the RNN model obtained by training to obtain a description sequence, wherein the predicting step comprises the following steps:
(1) extraction ofTest set VID t = { Video t 1 , ... , Video t N t } , NtFor the number of test set videos, t is the test set video, and 10 frames of images are extracted for each video, which can be expressed as: Image t = { Image t 1 , ... , Image t 10 } .
(2) using the trained model parameters theta ═ { W }i,Bij1, 5, j 1,2,3, and extracting Image using a CNN modeltThe CNN characteristic of each image is obtained to obtain an image characteristic It={It 1,…,It 10}。
(3) Characterizing an image ItUsing the model parameters W obtained by training as input, equation (12) is obtained, and the sentence description S ═ S is obtained1,…,Sn}. Thereby obtaining the sentence description corresponding to the video.
205: and sorting the reasonableness of the description sequence by using a LexRank method, and selecting the most reasonable description as a final description of the video.
(1) Video feature sequence I by using RNN modelt={It 1,…,It 10Testing to generate corresponding sentence set S ═ S1,…Si,…,Sn}。
(2) Generating sentence characteristics, and sequentially scanning each sentence S in all sentence setsiWherein i is 1, …, NdOne for each different word, forming a vocabulary of word list representations VOL ═ wi,…,wNwIn which N iswIs the total number of words in the vocabulary VOL. For each word w in the vocabulary VOLiSequentially scanning each sentence S in the set S of sentencesjCounting each word wiIn each sentence SjNumber of occurrences nijWhere j is 1, …, Ns,NsIs the total number of sentences and counts the words w contained in the set SiNumber of accompanying text num (w)i) (ii) a Calculate each word w according to equation (20)iIn each sentence SjWord frequency tf (w) of (1)i,sj) Where i is 1, …, Nd,NdIs the total number of words in the vocabulary, j is 1, …, Ns,NsIs the total number of all sentences S in the set;
t f ( w i , s j ) = n i j / &Sigma; k = 1 N w n k j - - - ( 20 )
wherein n iskjIs the number of occurrences of the kth word in the jth sentence.
For each word w in the vocabulary VOLiCalculating the inverse document word frequency idf (w) according to the formula (21)i);
idf(wi)=log(Nd/num(wi))(21)
Wherein N isdAs the number of words per sentence.
According to the vector space model, each sentence S in the set SjIs expressed as NwDimension vector, i-th dimension corresponding to word w in the vocabularyiThe value of tfidf (w)i) The calculation formula is as follows:
tfidf(wi)=tf(wi,sj)×idf(wi)(22)
(3) using two vectors Si,SjThe cosine value between them is used as sentence similarity, and the calculation formula is as follows:
s i m l i a r i t y ( S i , S j ) = &Sigma; w &Element; S i , S j tf w , S i tf w , S j ( idf w ) 2 &Sigma; s m &Element; S i ( tf s m , S i idf s m ) 2 &times; &Sigma; i n &Element; S j ( tf s n , S j idf s n ) 2 - - - ( 23 )
wherein,in sentence S for each word wiThe word frequency of (1);in sentence S for each word wjThe word frequency of (1); idfwAn inverse document word frequency for each word; smAs a sentence SiAny one of the words;as a word smAt SiThe word frequency of (1);as a word smThe inverse document word frequency of; snAs a sentence SjAny one of the words;as a word snAt SjThe word frequency of (1);as a word snThe inverse document word frequency.
And form a fully connected undirected graph, as in FIG. 4(a), with each node uiAs a sentence SiAnd the edges between the nodes are taken as sentence similarity.
(4) A threshold value Degree is set, and all edges with similarity less than Degree are deleted, as shown in fig. 4 (b).
(5) Calculate each sentence node uiThe LexRank score LR, the initial score of each sentence node is: d/N, where N is the number of sentence nodes, d is the damping factor, and d is usually selected to be [0.1,0.2 ]]The score LR is calculated according to equation (4):
L R ( u ) = ( 1 - d ) &Sigma; v &Element; a d j &lsqb; u &rsqb; L R ( v ) deg ( v ) + d N - - - ( 24 )
wherein deg (v) is the threshold of node v; LR (u) is the score for node u; LR (v) is the score for node v.
(6) And calculating the LR score of each sentence node, sequencing, and selecting the sentence with the highest score as the final description of the video.
In summary, the embodiment of the present invention, through steps 201 to 205, describes an event occurring in a segment of video and an object attribute related to the event through a natural language, so as to achieve the purpose of describing and summarizing video content.
Example 3
Two videos are selected as videos to be described, and as shown in fig. 5, the videos are predicted by using the method based on deep learning and text summarization in the present invention to output corresponding video descriptions:
(1) using ImageNet as a training set, each picture in the data set was sampled to 256 x 256 size pictures, I M A G E = { Image 1 , ... , Image N m } as input, NmThe number of pictures.
(2) Building a first layer of convolution layer, setting the size of a convolution kernel cov1 to be 11, setting the step length stride to be 4, selecting ReLU to be max (0, x), carrying out posing operation on the convolved featuremap, setting the size of the kernel to be 3 and the step length stride to be 2, and normalizing the convolved data by using local corresponding normalization, wherein in AlexNet, k is 2, n is 5, α is 10-4,β=0.75。
(3) And building a second layer of convolution layer, setting the size of a convolution kernel cov2 to be 5, setting the step length stride to be 1, selecting ReLU to be max (0, x), performing posing operation on the convolved featuremap, setting the size of the kernel to be 3 and the step length stride to be 2, and normalizing the convolved data by using local corresponding normalization.
(4) And building a third layer of convolution layer, setting the size of a convolution kernel cov3 to be 3, the step length stride to be 1, and selecting ReLU to be max (0, x).
(5) And building a fourth layer of convolution layer, setting the size of a convolution kernel cov4 to be 3, the step length stride to be 1, and selecting ReLU to be max (0, x).
(6) And building a fifth layer convolution layer, setting the size of a convolution kernel cov5 to be 3, the step size stride to be 1, selecting ReLU to be max (0, x), and performing posing operation on the convolved featuremap, wherein the size of the kernel is 3, and the step size stride to be 2.
(7) And building a sixth full connection layer, setting the layer as fc6, selecting ReLU as max (0, x), and performing dropout on the processed data.
(8) And building a seventh full connection layer, setting the layer as fc7, selecting ReLU as max (0, x), and performing dropout on the processed data.
(9) And building an eighth fully-connected layer, setting the layer to fc8, and adding a Softmax classifier as an objective function.
(10) And establishing a Convolutional Neural Network (CNN) model by setting the eight-layer network layers.
(11) And training CNN model parameters.
(12) Data processing: each video in the data set was uniformly extracted into 10 frames and sampled to 256 x 256 size. Inputting the image into a trained CNN model to obtain image characteristics, wherein each frame of image randomly corresponds to 5 sentences of text expression of the video to be used as image-text pairs
(13) A Recurrent Neural Network (RNN) model is constructed.
Fig. 5 is a video text description result generated after the invention. The image in the figure is divided into video frames extracted from a video, and sentences corresponding to each frame of image are the results obtained after the video features pass through a language model. The lower part of the picture represents the sentence generated by the video feature and the image migration and the description of the video script only after being summarized.
In summary, the embodiment of the present invention converts the frame sequence of each video into a series of sentences through the convolutional neural network and the cyclic neural network, and selects high-quality and representative sentences from a plurality of sentences through a text summarization method. The user can obtain the description of the video by using the method, the description accuracy is high, and the method can be popularized to the retrieval of the video.
Reference to the literature
[1] KrizhevskyA, sutskever i, hintong. image classification method based on deep convolutional neural networks [ J ] neural information processing system evolution, 2012.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (4)

1. A video description method based on deep learning and text summarization is characterized by comprising the following steps:
downloading videos from the Internet, describing each video to form a pair of < videos, description > and form a text description training set;
training a convolutional neural network model according to an image classification task through an existing image data set;
extracting a video frame sequence from the video, extracting the characteristics of a convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, a text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model;
describing a video frame sequence of a video to be described through a trained recurrent neural network model to obtain a description sequence;
and sequencing the description sequence by a method of taking the word centrality of the graph as the significance of the text summary, and outputting the final description result of the video.
2. The video description method based on deep learning and text summarization as claimed in claim 1, wherein the downloading of videos from the internet and the description of each video form < video, description > pairs, and the forming of the text description training set specifically comprises:
and forming a < video, description > pair by the existing video set and the sentence description corresponding to each video to form a text description training set.
3. The method according to claim 1, wherein the step of extracting a video frame sequence from the video, extracting the convolutional neural network features using a convolutional neural network model to form a < video frame sequence, text description sequence > pair as an input of the recursive neural network model, and training to obtain the recursive neural network model specifically comprises:
extracting the convolutional neural network characteristics of the image and sentence descriptions corresponding to the image for modeling by using the parameters after the convolutional neural network model is trained to obtain a target function;
constructing a recurrent neural network; modeling the nonlinear function through a long-time memory network;
and optimizing the objective function by using a gradient descent mode, and obtaining the trained long-time and short-time memory network parameters.
4. The video description method based on deep learning and text summarization according to claim 1, wherein the step of describing a sequence of video frames of a video to be described by the trained recurrent neural network model to obtain a description sequence specifically comprises:
extracting the convolutional neural network characteristics of each image by using the trained model parameters and the convolutional neural network model to obtain image characteristics;
and taking the image characteristics as input and obtaining sentence description by using the model parameters obtained by training so as to obtain the sentence description corresponding to the video.
CN201510697454.3A 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text Active CN105279495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510697454.3A CN105279495B (en) 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510697454.3A CN105279495B (en) 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text

Publications (2)

Publication Number Publication Date
CN105279495A true CN105279495A (en) 2016-01-27
CN105279495B CN105279495B (en) 2019-06-04

Family

ID=55148479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510697454.3A Active CN105279495B (en) 2015-10-23 2015-10-23 A kind of video presentation method summarized based on deep learning and text

Country Status (1)

Country Link
CN (1) CN105279495B (en)

Cited By (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105894043A (en) * 2016-04-27 2016-08-24 上海高智科技发展有限公司 Method and system for generating video description sentences
CN106126492A (en) * 2016-06-07 2016-11-16 北京高地信息技术有限公司 Statement recognition methods based on two-way LSTM neutral net and device
CN106227793A (en) * 2016-07-20 2016-12-14 合网络技术(北京)有限公司 A kind of video and the determination method and device of Video Key word degree of association
CN106372107A (en) * 2016-08-19 2017-02-01 中兴通讯股份有限公司 Generation method and device of natural language sentence library
CN106485251A (en) * 2016-10-08 2017-03-08 天津工业大学 Egg embryo classification based on deep learning
CN106503055A (en) * 2016-09-27 2017-03-15 天津大学 A kind of generation method from structured text to iamge description
CN106599198A (en) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method for multi-stage connection recurrent neural network
CN106650756A (en) * 2016-12-28 2017-05-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image text description method based on knowledge transfer multi-modal recurrent neural network
CN106650789A (en) * 2016-11-16 2017-05-10 同济大学 Image description generation method based on depth LSTM network
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106845411A (en) * 2017-01-19 2017-06-13 清华大学 A kind of video presentation generation method based on deep learning and probability graph model
CN106886768A (en) * 2017-03-02 2017-06-23 杭州当虹科技有限公司 A kind of video fingerprinting algorithms based on deep learning
CN106934352A (en) * 2017-02-28 2017-07-07 华南理工大学 A kind of video presentation method based on two-way fractal net work and LSTM
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
CN107203598A (en) * 2017-05-08 2017-09-26 广州智慧城市发展研究院 A kind of method and system for realizing image switch labels
WO2017168252A1 (en) * 2016-03-31 2017-10-05 Maluuba Inc. Method and system for processing an input query
CN107292086A (en) * 2016-04-07 2017-10-24 西门子保健有限责任公司 Graphical analysis question and answer
CN107291882A (en) * 2017-06-19 2017-10-24 江苏软开信息科技有限公司 A kind of data automatic statistical analysis method
CN107368887A (en) * 2017-07-25 2017-11-21 江西理工大学 A kind of structure and its construction method of profound memory convolutional neural networks
CN107391505A (en) * 2016-05-16 2017-11-24 腾讯科技(深圳)有限公司 A kind of image processing method and system
CN107515900A (en) * 2017-07-24 2017-12-26 宗晖(上海)机器人有限公司 Intelligent robot and its event memorandum system and method
CN107578062A (en) * 2017-08-19 2018-01-12 四川大学 A kind of picture based on attribute probability vector guiding attention mode describes method
CN107609501A (en) * 2017-09-05 2018-01-19 东软集团股份有限公司 The close action identification method of human body and device, storage medium, electronic equipment
CN107707931A (en) * 2016-08-08 2018-02-16 阿里巴巴集团控股有限公司 Generated according to video data and explain data, data synthesis method and device, electronic equipment
CN107784372A (en) * 2016-08-24 2018-03-09 阿里巴巴集团控股有限公司 Forecasting Methodology, the device and system of destination object attribute
CN107818306A (en) * 2017-10-31 2018-03-20 天津大学 A kind of video answering method based on attention model
CN107844751A (en) * 2017-10-19 2018-03-27 陕西师范大学 The sorting technique of guiding filtering length Memory Neural Networks high-spectrum remote sensing
CN108200483A (en) * 2017-12-26 2018-06-22 中国科学院自动化研究所 Dynamically multi-modal video presentation generation method
CN108228686A (en) * 2017-06-15 2018-06-29 北京市商汤科技开发有限公司 It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN108307229A (en) * 2018-02-02 2018-07-20 新华智云科技有限公司 A kind of processing method and equipment of video-audio data
CN108491208A (en) * 2018-01-31 2018-09-04 中山大学 A kind of code annotation sorting technique based on neural network model
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN108665055A (en) * 2017-03-28 2018-10-16 上海荆虹电子科技有限公司 A kind of figure says generation method and device
CN108683924A (en) * 2018-05-30 2018-10-19 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
CN108734614A (en) * 2017-04-13 2018-11-02 腾讯科技(深圳)有限公司 Traffic congestion prediction technique and device, storage medium
CN108765383A (en) * 2018-03-22 2018-11-06 山西大学 Video presentation method based on depth migration study
CN108830287A (en) * 2018-04-18 2018-11-16 哈尔滨理工大学 The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN108881950A (en) * 2018-05-30 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
WO2019024083A1 (en) * 2017-08-04 2019-02-07 Nokia Technologies Oy Artificial neural network
CN109522451A (en) * 2018-12-13 2019-03-26 连尚(新昌)网络科技有限公司 Repeat video detecting method and device
CN109522531A (en) * 2017-09-18 2019-03-26 腾讯科技(北京)有限公司 Official documents and correspondence generation method and device, storage medium and electronic device
CN109711022A (en) * 2018-12-17 2019-05-03 哈尔滨工程大学 A kind of submarine anti-sinking system based on deep learning
CN109891897A (en) * 2016-10-27 2019-06-14 诺基亚技术有限公司 Method for analyzing media content
CN109960747A (en) * 2019-04-02 2019-07-02 腾讯科技(深圳)有限公司 The generation method of video presentation information, method for processing video frequency, corresponding device
CN110019952A (en) * 2017-09-30 2019-07-16 华为技术有限公司 Video presentation method, system and device
CN110119750A (en) * 2018-02-05 2019-08-13 浙江宇视科技有限公司 Data processing method, device and electronic equipment
CN110188779A (en) * 2019-06-03 2019-08-30 中国矿业大学 A kind of generation method of image, semantic description
CN110210499A (en) * 2019-06-03 2019-09-06 中国矿业大学 A kind of adaptive generation system of image, semantic description
US10445871B2 (en) 2017-05-22 2019-10-15 General Electric Company Image analysis neural network systems
CN110612537A (en) * 2017-05-02 2019-12-24 柯达阿拉里斯股份有限公司 System and method for batch normalized loop highway network
CN110678816A (en) * 2017-04-04 2020-01-10 西门子股份公司 Method and control device for controlling a technical system
CN110765921A (en) * 2019-10-18 2020-02-07 北京工业大学 Video object positioning method based on weak supervised learning and video spatiotemporal features
CN110781345A (en) * 2019-10-31 2020-02-11 北京达佳互联信息技术有限公司 Video description generation model acquisition method, video description generation method and device
CN111325068A (en) * 2018-12-14 2020-06-23 北京京东尚科信息技术有限公司 Video description method and device based on convolutional neural network
CN111400545A (en) * 2020-03-01 2020-07-10 西北工业大学 Video annotation method based on deep learning
CN111404676A (en) * 2020-03-02 2020-07-10 北京丁牛科技有限公司 Method and device for generating, storing and transmitting secure and secret key and cipher text
CN111461974A (en) * 2020-02-17 2020-07-28 天津大学 Image scanning path control method based on L STM model from coarse to fine
CN111488807A (en) * 2020-03-29 2020-08-04 复旦大学 Video description generation system based on graph convolution network
CN111681676A (en) * 2020-06-09 2020-09-18 杭州星合尚世影视传媒有限公司 Method, system and device for identifying and constructing audio frequency by video object and readable storage medium
WO2020220702A1 (en) * 2019-04-29 2020-11-05 北京三快在线科技有限公司 Generation of natural language
CN111931690A (en) * 2020-08-28 2020-11-13 Oppo广东移动通信有限公司 Model training method, device, equipment and storage medium
WO2021056750A1 (en) * 2019-09-29 2021-04-01 北京市商汤科技开发有限公司 Search method and device, and storage medium
CN113191262A (en) * 2021-04-29 2021-07-30 桂林电子科技大学 Video description data processing method, device and storage medium
CN113641854A (en) * 2021-07-28 2021-11-12 上海影谱科技有限公司 Method and system for converting characters into video
US11328512B2 (en) 2019-09-30 2022-05-10 Wipro Limited Method and system for generating a text summary for a multimedia content

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442927B2 (en) * 2009-07-30 2013-05-14 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8442927B2 (en) * 2009-07-30 2013-05-14 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
CN104113789A (en) * 2014-07-10 2014-10-22 杭州电子科技大学 On-line video abstraction generation method based on depth learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUNES ERKAN: "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization", 《JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH》 *
SUBHASHINI VENUGOPALAN等: "Translating Videos to Natural Language Using Deep Recurrent Neural Networks", 《COMPUTER SCIENCE》 *

Cited By (106)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10437929B2 (en) 2016-03-31 2019-10-08 Maluuba Inc. Method and system for processing an input query using a forward and a backward neural network specific to unigrams
WO2017168252A1 (en) * 2016-03-31 2017-10-05 Maluuba Inc. Method and system for processing an input query
CN107292086A (en) * 2016-04-07 2017-10-24 西门子保健有限责任公司 Graphical analysis question and answer
CN105894043A (en) * 2016-04-27 2016-08-24 上海高智科技发展有限公司 Method and system for generating video description sentences
CN107391505A (en) * 2016-05-16 2017-11-24 腾讯科技(深圳)有限公司 A kind of image processing method and system
CN107391505B (en) * 2016-05-16 2020-10-23 腾讯科技(深圳)有限公司 Image processing method and system
CN106126492A (en) * 2016-06-07 2016-11-16 北京高地信息技术有限公司 Statement recognition methods based on two-way LSTM neutral net and device
CN106126492B (en) * 2016-06-07 2019-02-05 北京高地信息技术有限公司 Sentence recognition methods and device based on two-way LSTM neural network
CN106227793A (en) * 2016-07-20 2016-12-14 合网络技术(北京)有限公司 A kind of video and the determination method and device of Video Key word degree of association
CN106227793B (en) * 2016-07-20 2019-10-22 优酷网络技术(北京)有限公司 A kind of determination method and device of video and the Video Key word degree of correlation
CN107707931A (en) * 2016-08-08 2018-02-16 阿里巴巴集团控股有限公司 Generated according to video data and explain data, data synthesis method and device, electronic equipment
CN106372107B (en) * 2016-08-19 2020-01-17 中兴通讯股份有限公司 Method and device for generating natural language sentence library
CN106372107A (en) * 2016-08-19 2017-02-01 中兴通讯股份有限公司 Generation method and device of natural language sentence library
CN107784372A (en) * 2016-08-24 2018-03-09 阿里巴巴集团控股有限公司 Forecasting Methodology, the device and system of destination object attribute
CN107784372B (en) * 2016-08-24 2022-02-22 阿里巴巴集团控股有限公司 Target object attribute prediction method, device and system
CN106503055B (en) * 2016-09-27 2019-06-04 天津大学 A kind of generation method from structured text to iamge description
CN106503055A (en) * 2016-09-27 2017-03-15 天津大学 A kind of generation method from structured text to iamge description
CN106485251B (en) * 2016-10-08 2019-12-24 天津工业大学 Egg embryo classification based on deep learning
CN106485251A (en) * 2016-10-08 2017-03-08 天津工业大学 Egg embryo classification based on deep learning
CN109891897B (en) * 2016-10-27 2021-11-05 诺基亚技术有限公司 Method for analyzing media content
US11068722B2 (en) 2016-10-27 2021-07-20 Nokia Technologies Oy Method for analysing media content to generate reconstructed media content
CN109891897A (en) * 2016-10-27 2019-06-14 诺基亚技术有限公司 Method for analyzing media content
CN106650789B (en) * 2016-11-16 2023-04-07 同济大学 Image description generation method based on depth LSTM network
CN106650789A (en) * 2016-11-16 2017-05-10 同济大学 Image description generation method based on depth LSTM network
CN106782602A (en) * 2016-12-01 2017-05-31 南京邮电大学 Speech-emotion recognition method based on length time memory network and convolutional neural networks
CN106599198B (en) * 2016-12-14 2021-04-06 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method of multi-cascade junction cyclic neural network
CN106599198A (en) * 2016-12-14 2017-04-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image description method for multi-stage connection recurrent neural network
CN106650756B (en) * 2016-12-28 2019-12-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 knowledge migration-based image text description method of multi-mode recurrent neural network
CN106650756A (en) * 2016-12-28 2017-05-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image text description method based on knowledge transfer multi-modal recurrent neural network
CN106845411B (en) * 2017-01-19 2020-06-30 清华大学 Video description generation method based on deep learning and probability map model
CN106845411A (en) * 2017-01-19 2017-06-13 清华大学 A kind of video presentation generation method based on deep learning and probability graph model
CN106934352A (en) * 2017-02-28 2017-07-07 华南理工大学 A kind of video presentation method based on two-way fractal net work and LSTM
CN106886768A (en) * 2017-03-02 2017-06-23 杭州当虹科技有限公司 A kind of video fingerprinting algorithms based on deep learning
WO2018170671A1 (en) * 2017-03-20 2018-09-27 Intel Corporation Topic-guided model for image captioning system
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
CN108665055B (en) * 2017-03-28 2020-10-23 深圳荆虹科技有限公司 Method and device for generating graphic description
CN108665055A (en) * 2017-03-28 2018-10-16 上海荆虹电子科技有限公司 A kind of figure says generation method and device
US10983485B2 (en) 2017-04-04 2021-04-20 Siemens Aktiengesellschaft Method and control device for controlling a technical system
CN110678816B (en) * 2017-04-04 2021-02-19 西门子股份公司 Method and control device for controlling a technical system
CN110678816A (en) * 2017-04-04 2020-01-10 西门子股份公司 Method and control device for controlling a technical system
CN108734614A (en) * 2017-04-13 2018-11-02 腾讯科技(深圳)有限公司 Traffic congestion prediction technique and device, storage medium
CN110612537A (en) * 2017-05-02 2019-12-24 柯达阿拉里斯股份有限公司 System and method for batch normalized loop highway network
CN107203598A (en) * 2017-05-08 2017-09-26 广州智慧城市发展研究院 A kind of method and system for realizing image switch labels
US10445871B2 (en) 2017-05-22 2019-10-15 General Electric Company Image analysis neural network systems
CN108228686A (en) * 2017-06-15 2018-06-29 北京市商汤科技开发有限公司 It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN108228686B (en) * 2017-06-15 2021-03-23 北京市商汤科技开发有限公司 Method and device for realizing image-text matching and electronic equipment
CN107291882A (en) * 2017-06-19 2017-10-24 江苏软开信息科技有限公司 A kind of data automatic statistical analysis method
CN107515900A (en) * 2017-07-24 2017-12-26 宗晖(上海)机器人有限公司 Intelligent robot and its event memorandum system and method
CN107368887A (en) * 2017-07-25 2017-11-21 江西理工大学 A kind of structure and its construction method of profound memory convolutional neural networks
CN107368887B (en) * 2017-07-25 2020-08-07 江西理工大学 Deep memory convolutional neural network device and construction method thereof
US11481625B2 (en) 2017-08-04 2022-10-25 Nokia Technologies Oy Artificial neural network
WO2019024083A1 (en) * 2017-08-04 2019-02-07 Nokia Technologies Oy Artificial neural network
CN107578062A (en) * 2017-08-19 2018-01-12 四川大学 A kind of picture based on attribute probability vector guiding attention mode describes method
CN107609501A (en) * 2017-09-05 2018-01-19 东软集团股份有限公司 The close action identification method of human body and device, storage medium, electronic equipment
CN109522531A (en) * 2017-09-18 2019-03-26 腾讯科技(北京)有限公司 Official documents and correspondence generation method and device, storage medium and electronic device
CN109522531B (en) * 2017-09-18 2023-04-07 腾讯科技(北京)有限公司 Document generation method and device, storage medium and electronic device
CN110019952B (en) * 2017-09-30 2023-04-18 华为技术有限公司 Video description method, system and device
CN110019952A (en) * 2017-09-30 2019-07-16 华为技术有限公司 Video presentation method, system and device
CN107844751A (en) * 2017-10-19 2018-03-27 陕西师范大学 The sorting technique of guiding filtering length Memory Neural Networks high-spectrum remote sensing
CN107844751B (en) * 2017-10-19 2021-08-27 陕西师范大学 Method for classifying hyperspectral remote sensing images of guide filtering long and short memory neural network
CN107818306B (en) * 2017-10-31 2020-08-07 天津大学 Video question-answering method based on attention model
CN107818306A (en) * 2017-10-31 2018-03-20 天津大学 A kind of video answering method based on attention model
CN108200483A (en) * 2017-12-26 2018-06-22 中国科学院自动化研究所 Dynamically multi-modal video presentation generation method
CN108200483B (en) * 2017-12-26 2020-02-28 中国科学院自动化研究所 Dynamic multi-modal video description generation method
CN108491208A (en) * 2018-01-31 2018-09-04 中山大学 A kind of code annotation sorting technique based on neural network model
CN108307229A (en) * 2018-02-02 2018-07-20 新华智云科技有限公司 A kind of processing method and equipment of video-audio data
CN108307229B (en) * 2018-02-02 2023-12-22 新华智云科技有限公司 Video and audio data processing method and device
CN110119750A (en) * 2018-02-05 2019-08-13 浙江宇视科技有限公司 Data processing method, device and electronic equipment
CN108765383A (en) * 2018-03-22 2018-11-06 山西大学 Video presentation method based on depth migration study
CN108765383B (en) * 2018-03-22 2022-03-18 山西大学 Video description method based on deep migration learning
CN108830287A (en) * 2018-04-18 2018-11-16 哈尔滨理工大学 The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN108881950B (en) * 2018-05-30 2021-05-25 北京奇艺世纪科技有限公司 Video processing method and device
CN108881950A (en) * 2018-05-30 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
CN108683924A (en) * 2018-05-30 2018-10-19 北京奇艺世纪科技有限公司 A kind of method and apparatus of video processing
CN109522451B (en) * 2018-12-13 2024-02-27 连尚(新昌)网络科技有限公司 Repeated video detection method and device
CN109522451A (en) * 2018-12-13 2019-03-26 连尚(新昌)网络科技有限公司 Repeat video detecting method and device
CN111325068B (en) * 2018-12-14 2023-11-07 北京京东尚科信息技术有限公司 Video description method and device based on convolutional neural network
CN111325068A (en) * 2018-12-14 2020-06-23 北京京东尚科信息技术有限公司 Video description method and device based on convolutional neural network
CN109711022A (en) * 2018-12-17 2019-05-03 哈尔滨工程大学 A kind of submarine anti-sinking system based on deep learning
CN109960747B (en) * 2019-04-02 2022-12-16 腾讯科技(深圳)有限公司 Video description information generation method, video processing method and corresponding devices
CN109960747A (en) * 2019-04-02 2019-07-02 腾讯科技(深圳)有限公司 The generation method of video presentation information, method for processing video frequency, corresponding device
US11861886B2 (en) 2019-04-02 2024-01-02 Tencent Technology (Shenzhen) Company Limited Method and apparatus for generating video description information, and method and apparatus for video processing
WO2020220702A1 (en) * 2019-04-29 2020-11-05 北京三快在线科技有限公司 Generation of natural language
CN110210499B (en) * 2019-06-03 2023-10-13 中国矿业大学 Self-adaptive generation system for image semantic description
CN110188779A (en) * 2019-06-03 2019-08-30 中国矿业大学 A kind of generation method of image, semantic description
CN110210499A (en) * 2019-06-03 2019-09-06 中国矿业大学 A kind of adaptive generation system of image, semantic description
WO2021056750A1 (en) * 2019-09-29 2021-04-01 北京市商汤科技开发有限公司 Search method and device, and storage medium
US11328512B2 (en) 2019-09-30 2022-05-10 Wipro Limited Method and system for generating a text summary for a multimedia content
CN110765921A (en) * 2019-10-18 2020-02-07 北京工业大学 Video object positioning method based on weak supervised learning and video spatiotemporal features
CN110765921B (en) * 2019-10-18 2022-04-19 北京工业大学 Video object positioning method based on weak supervised learning and video spatiotemporal features
CN110781345B (en) * 2019-10-31 2022-12-27 北京达佳互联信息技术有限公司 Video description generation model obtaining method, video description generation method and device
CN110781345A (en) * 2019-10-31 2020-02-11 北京达佳互联信息技术有限公司 Video description generation model acquisition method, video description generation method and device
CN111461974A (en) * 2020-02-17 2020-07-28 天津大学 Image scanning path control method based on L STM model from coarse to fine
CN111461974B (en) * 2020-02-17 2023-04-25 天津大学 Image scanning path control method based on LSTM model from coarse to fine
CN111400545A (en) * 2020-03-01 2020-07-10 西北工业大学 Video annotation method based on deep learning
CN111404676B (en) * 2020-03-02 2023-08-29 北京丁牛科技有限公司 Method and device for generating, storing and transmitting secret key and ciphertext
CN111404676A (en) * 2020-03-02 2020-07-10 北京丁牛科技有限公司 Method and device for generating, storing and transmitting secure and secret key and cipher text
CN111488807A (en) * 2020-03-29 2020-08-04 复旦大学 Video description generation system based on graph convolution network
CN111488807B (en) * 2020-03-29 2023-10-10 复旦大学 Video description generation system based on graph rolling network
CN111681676B (en) * 2020-06-09 2023-08-08 杭州星合尚世影视传媒有限公司 Method, system, device and readable storage medium for constructing audio frequency by video object identification
CN111681676A (en) * 2020-06-09 2020-09-18 杭州星合尚世影视传媒有限公司 Method, system and device for identifying and constructing audio frequency by video object and readable storage medium
CN111931690A (en) * 2020-08-28 2020-11-13 Oppo广东移动通信有限公司 Model training method, device, equipment and storage medium
CN111931690B (en) * 2020-08-28 2024-08-13 Oppo广东移动通信有限公司 Model training method, device, equipment and storage medium
CN113191262A (en) * 2021-04-29 2021-07-30 桂林电子科技大学 Video description data processing method, device and storage medium
CN113641854B (en) * 2021-07-28 2023-09-26 上海影谱科技有限公司 Method and system for converting text into video
CN113641854A (en) * 2021-07-28 2021-11-12 上海影谱科技有限公司 Method and system for converting characters into video

Also Published As

Publication number Publication date
CN105279495B (en) 2019-06-04

Similar Documents

Publication Publication Date Title
CN105279495B (en) A kind of video presentation method summarized based on deep learning and text
CN106503055B (en) A kind of generation method from structured text to iamge description
CN108829822B (en) Media content recommendation method and device, storage medium and electronic device
CN109670039B (en) Semi-supervised e-commerce comment emotion analysis method based on three-part graph and cluster analysis
US20210256051A1 (en) Theme classification method based on multimodality, device, and storage medium
CN108073568B (en) Keyword extraction method and device
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN111783474B (en) Comment text viewpoint information processing method and device and storage medium
CN107943784B (en) Relationship extraction method based on generation of countermeasure network
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN110532379B (en) Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis
CN112819023B (en) Sample set acquisition method, device, computer equipment and storage medium
CN110851641B (en) Cross-modal retrieval method and device and readable storage medium
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN110750648A (en) Text emotion classification method based on deep learning and feature fusion
CN110263854A (en) Live streaming label determines method, apparatus and storage medium
CN110377778A (en) Figure sort method, device and electronic equipment based on title figure correlation
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN113343690A (en) Text readability automatic evaluation method and device
CN113051932A (en) Method for detecting category of network media event of semantic and knowledge extension topic model
CN110717090A (en) Network public praise evaluation method and system for scenic spots and electronic equipment
CN111507089A (en) Document classification method and device based on deep learning model and computer equipment
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN115775349A (en) False news detection method and device based on multi-mode fusion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220322

Address after: 511400 4th floor, No. 685, Shiqiao South Road, Panyu District, Guangzhou, Guangdong

Patentee after: GUANGZHOU WELLTHINKER AUTOMATION TECHNOLOGY CO.,LTD.

Address before: 300072 Tianjin City, Nankai District Wei Jin Road No. 92

Patentee before: Tianjin University

TR01 Transfer of patent right