CN105279495A - Video description method based on deep learning and text summarization - Google Patents
Video description method based on deep learning and text summarization Download PDFInfo
- Publication number
- CN105279495A CN105279495A CN201510697454.3A CN201510697454A CN105279495A CN 105279495 A CN105279495 A CN 105279495A CN 201510697454 A CN201510697454 A CN 201510697454A CN 105279495 A CN105279495 A CN 105279495A
- Authority
- CN
- China
- Prior art keywords
- video
- description
- neural network
- network model
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000013135 deep learning Methods 0.000 title claims abstract description 16
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 47
- 238000012549 training Methods 0.000 claims abstract description 35
- 230000000306 recurrent effect Effects 0.000 claims abstract description 20
- 238000003062 neural network model Methods 0.000 claims abstract description 15
- 230000006870 function Effects 0.000 claims description 19
- 238000013528 artificial neural network Methods 0.000 claims description 14
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 description 9
- 230000009471 action Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013016 damping Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a video description method based on deep learning and text summarization. The video description method comprises the following steps: through a traditional image data set, training a convolutional neural network model according to an image classification task; extracting a video frame sequence of a video, utilizing the convolutional neural network model to extract convolutional neural network characteristics to form a <video frame sequence, text description sequence> pair which is used as the input of a recurrent neural network model, and training to obtain the recurrent neural network model; describing the video frame sequence of the video to be described through the recurrent neural network model obtained by training to obtain description sequences; and through a method that graph-based vocabulary centrality is used as the significance of the text summarization, sorting the description sequences, and outputting a final description result of the video. An event which happens in one video and object attributes associated with the event are described through a natural language so as to achieve a purpose that video contents are described and summarized.
Description
Technical Field
The invention relates to the field of video description, in particular to a video description method based on deep learning and text summarization.
Background
Describing a video in natural language is extremely important both for understanding the video and for retrieving the video on the Web. Meanwhile, the language description of video is also the subject of intensive research in the fields of multimedia and computer vision. The video description means that for a given video, the video characteristics are obtained by observing the contents contained in the video, and corresponding sentences are generated according to the contents. When people see a video, especially videos of some action categories, after the people watch the video, the people can know the video to a certain extent, and can speak things happening in the video through the language. For example: the video is described using the sentence "a person is riding a motorcycle". However, in the case of a large number of videos, a great deal of time, labor and financial resources are required to describe the videos one by one in a manual manner. It is necessary to analyze the video features using computer technology and combine with natural language processing methods to generate descriptions of the video. On one hand, through the video description method, people can more accurately understand the video from the semantic perspective. On the other hand, in the field of video retrieval, it is very difficult and challenging for a user to input a text description to retrieve a corresponding video.
Various video description methods have emerged over the past few years, such as: by analyzing the video characteristics, the objects existing in the video and the action relationship among the objects can be identified. Then, adopting a fixed language template: and the subject + verb + object determines the subject and the object from the recognized objects and takes the action relationship between the objects as a predicate, and the description of the sentence on the video is generated in such a way.
However, such a method has certain limitations, for example, generating sentences by using language templates easily results in relatively fixed sentence patterns of the generated sentences, and the sentence patterns are too single and lack colors expressed by natural human languages. Meanwhile, different characteristics are needed for identifying objects, actions and the like in the video, so that the steps are relatively complicated, and a large amount of time is needed for training the video characteristics. Moreover, the recognition accuracy directly affects the quality of the generated sentences, and the step-by-step method needs to ensure higher correctness at each step and is difficult to implement.
Disclosure of Invention
The invention provides a video description method based on deep learning and text summarization, which describes the happening events and the object attributes related to the events in a section of video through natural language, thereby achieving the purpose of describing and summarizing the video content, and the details are described as follows:
a video description method based on deep learning and text summarization is characterized by comprising the following steps:
downloading videos from the Internet, describing each video to form a pair of < videos, description > and form a text description training set;
training a convolutional neural network model according to an image classification task through an existing image data set;
extracting a video frame sequence from the video, extracting the characteristics of a convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, a text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model;
describing a video frame sequence of a video to be described through a trained recurrent neural network model to obtain a description sequence;
and sequencing the description sequence by a method of taking the word centrality of the graph as the significance of the text summary, and outputting the final description result of the video.
The downloading of videos from the internet and the description of each video form a < video, description > pair, and the forming of a text description training set specifically comprises:
and forming a < video, description > pair by the existing video set and the sentence description corresponding to each video to form a text description training set.
The steps of extracting a video frame sequence from the video, extracting the characteristics of the convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model specifically comprise:
extracting the convolutional neural network characteristics of the image and sentence descriptions corresponding to the image for modeling by using the parameters after the convolutional neural network model is trained to obtain a target function;
constructing a recurrent neural network; modeling the nonlinear function through a long-time memory network;
and optimizing the objective function by using a gradient descent mode, and obtaining the trained long-time and short-time memory network parameters.
The step of describing the video frame sequence of the video to be described by the trained recurrent neural network model to obtain the description sequence specifically comprises the following steps:
extracting the convolutional neural network characteristics of each image by using the trained model parameters and the convolutional neural network model to obtain image characteristics;
and taking the image characteristics as input and obtaining sentence description by using the model parameters obtained by training so as to obtain the sentence description corresponding to the video.
The technical scheme provided by the invention has the beneficial effects that: each video is composed of a frame sequence, the bottom layer characteristics of each frame of the video are extracted by using a convolutional neural network, and by adopting the method, excessive noise points caused by the traditional method of extracting the video characteristics by using deep learning can be effectively avoided, and the accuracy of generating sentences at the later stage is reduced. Each frame picture is converted into a sentence using a trained recurrent neural network, thereby generating a set of sentences. And the method for automatically summarizing the text is used for screening out high-quality and representative sentences from the sentence set by calculating the centrality between the sentences as the description of the video, and the method can generate better video description effect and accuracy and sentence diversity. Meanwhile, the method based on the depth and text summarization can be effectively popularized to the application of video retrieval, but the method is only limited to English description of video content.
Drawings
FIG. 1 is a flow chart of a video description method based on deep learning and text summarization;
FIG. 2 is a schematic diagram of a convolutional neural network model (CNN) used in the present invention;
wherein Cov represents a convolution kernel; ReLU is expressed by the formula max (0, x); pool stands for Pooling operation; LRN is local corresponding normalization operation; softmax is the objective function.
FIG. 3 is a schematic diagram of a recurrent neural network used in the present invention;
wherein t represents the input in the t state; h ist-1A hidden state representing a previous state; i is inputgate; f is forgetgate; o is output gate; c is a cell; m istIs the output after passing through an LSTM unit.
FIG. 4(a) is a drawing showing a LexRank pruned connection;
wherein S ═ { S ═ S1,…,S1010 sentences generated by a Recurrent Neural Network (RNN) are represented as 10 nodes by adopting a graph mode; the similarity between the nodes is represented by straight lines and forms a full-connection graph, and the thickness of the connecting lines represents the size of the similarity.
FIG. 4(b) is an initial full-link diagram of LexRank;
by setting a threshold, the connecting lines with small similarity between the nodes are removed, and the connecting lines between the rest nodes, namely the similarity between sentences is high.
Fig. 5 is a schematic diagram of a sentence generated after a part of a video frame is described.
Wherein, each frame of image is followed by a sentence generated by adopting the CNN-RNN model used in the invention, and the arrow pointing part of the sentence is the summary of the video text description after LexRank method and is used as the text description of the video.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Based on the problems in the background art and after the description effect of the image is remarkably improved by using the deep learning method in the image, people are inspired from the image, and the diversity and the correctness of the generated video description are improved to a certain extent by using the deep learning method in the video.
The embodiment of the invention provides a video description method based on deep learning and text summarization. Each video feature is then taken as input into a recurrent neural network framework with which a sentence description can be generated for each visual feature, i.e. each frame of the video. Thus, a sentence set is obtained, in order to obtain the most expressive and high-quality sentences as the description of the video, the method adopts a text summarization method, and all sentences are sequenced by calculating the similarity between the sentences, so that some wrong sentences and low-quality sentences are avoided as the final description of the video. By adopting the automatic text summarization method, a representative sentence can be obtained, and certain correctness and reliability are achieved, so that the accuracy of video description is improved. Meanwhile, the method also overcomes some technical difficulties faced by video retrieval.
Example 1
A video description method based on deep learning and text summarization, referring to fig. 1, the method comprising the steps of:
101: downloading videos from the Internet, describing each video (English description) to form a < video, description > pair, and forming a text description training set, wherein each video corresponds to a plurality of sentence descriptions, so that a text description sequence is formed;
102: training a Convolutional Neural Network (CNN) model according to an image classification task by using the existing image data set;
for example: ImageNet.
103: extracting a video frame sequence from a video, extracting CNN characteristics by using a Convolutional Neural Network (CNN) model, forming a < video frame sequence, a text description sequence > as an input of a Recurrent Neural Network (RNN) model, and training to obtain the Recurrent Neural Network (RNN) model;
104: describing a video frame sequence of a video to be described by using the RNN model obtained by training to obtain a description sequence;
105: the reasonableness of the description sequence is ranked by using a method based on the lexical centrality of the graph as the significance (LexRank) of the text summary, and the most reasonable description is selected as the final description of the video.
In summary, the embodiments of the present invention implement, through steps 101 to 105, to describe an event occurring in a segment of video and an object attribute related to the event through a natural language, so as to achieve the purpose of describing and summarizing video content.
Example 2
201: downloading images from the Internet, describing each video to form a pair of < video and description >, and forming a text description training set;
the method specifically comprises the following steps:
(1) downloading from the Internet a Microsoft research institute video description data set (Microsoft research video DescriptionCorpus) comprising 1970 video segments collected from YouTube, which may be represented as a data set Wherein N isdIs a video in the set VIDAnd (4) total number.
(2) Each video has a plurality of corresponding descriptions, and each Sentence of the video is described as "sequences ═ sequences1,…,SentenceNN represents a Sentence corresponding to each video (sequence)1,…,SentenceN) The number of descriptions of (c).
(3) The < video, description > pair is formed by the existing video set VID and the sentence description sequences corresponding to each video, and a text description training set is formed.
202: training a Convolutional Neural Network (CNN) model according to an image classification task by using the existing image data set, and training CNN model parameters;
the method specifically comprises the following steps:
(1) constructing AlexNet [1] CNN model shown in FIG. 2: the model comprises 8 network layers, wherein the first 5 layers are convolutional layers, and the last 3 layers are full-connection layers.
(2) Using Imagenet as a training set, each picture in the image dataset was sampled to 256 x 256 size pictures, as input, NmIs a pictureThe number of layers 1, according to the network layer set in fig. 2, can be expressed as:
F1(IMAGE)=norm{pool[max(0,W1*IMAGE+B1)]}(1)
wherein IMAGE represents an input IMAGE; w1Representing convolution kernel parameters; b is1Represents a bias; f1(IMAGE) is expressed as an output after passing through the first layer network; norm denotes the normalization operation. In this network layer, x is W by a linear correction function (max (0, x), x is1*IMAGE+B1) Processing the convolved image, performing a mapping pool operation, and performing local corresponding normalization (LRN) on the convolved image, wherein the normalization mode is as follows:
wherein M is the number of feature maps after posing; i is the ith of the M feature maps; n is the size of local normalization, namely, normalization is carried out on every n feature maps; a isi x,yRepresenting the corresponding value at coordinate (x, y) in the ith feature map, k being the offset, α being the normalized parameter, bi x,yIs the output result after local corresponding normalization (LRN).
In AlexNet, k is 2, n is 5, α is 10-4,β=0.75。
Continuing to use the model, F1(IMAGE) as input to the second network layer, according to the second layer network layer, can be expressed as:
F2(IMAGE)=max(0,W2*F1(IMAGE)+B2)(3)
wherein, W2Representing convolution kernel parameters; b is2Represents a bias; f2(IMAGE) is expressed as the output after the second layer network. The first layer and the second layer are arranged identically, except the mapping core of the convolution layer and the posing layerThe size of the kernel varies.
According to the network setup of AlexNet, the remaining convolutional layers can be represented in turn as:
F3(IMAGE)=max(0,W3*F2(IMAGE)+B3)(4)
F4(IMAGE)=max(0,W4*F3(IMAGE)+B4)(5)
F5(IMAGE)=pool[max(0,W5*F4(IMAGE)+B5)](6)
wherein, W3,W4,W5And B3,B4,B5The convolution parameters and the bias for each layer.
The last 3 layers are fully connected layers, and the network layer settings according to fig. 2 can be expressed in turn as:
F6(IMAGE)=fc[F5(IMAGE),θ1](7)
F7(IMAGE)=fc[F6(IMAGE),θ2](8)
F8(IMAGE)=fc[F7(IMAGE),θ3](9)
wherein fc represents the full connection layer, θ1,θ2,θ3Parameters of three fully-connected layers are expressed, and the characteristic F of the last layer is expressed8(IMAGE) input to a multivariate classifier of 1000 classes for classification.
(3) According to the current network, a multivariate classifier is set, and the formula can be expressed as:
wherein l (Θ) is an objective function, m is the category of the image in Imagenet, and x(t)CNN features, y, extracted for each class after passing through Alexnet network(t)For each image corresponding label, Θ ═ Wp,Bp,θq1, 5, and q 1,2,3, which are parameters in each network layer. And optimizing the target function parameters by adopting a gradient descent method, thereby obtaining the parameter theta set by the Alexnet network.
203: extracting a video frame sequence from a video, extracting CNN characteristics by using a Convolutional Neural Network (CNN) model, forming a < video frame sequence, a text description sequence > as an input of a Recurrent Neural Network (RNN) model, and training to obtain the Recurrent Neural Network (RNN) model;
the method comprises the following steps:
(1) according to step 201, using the parameters after the CNN model training, extracting the CNN feature I of the image and the sentence description S corresponding to the image for modeling, wherein the objective function is as follows:
θ*=argmax∑logp(S|I;θ)
(11)
wherein (S, I) represents an image-text pair in the training data; theta is a parameter to be optimized of the model; theta is the optimized parameter;
the training aims to maximize the sum of the logarithmic probabilities of the sentences generated by all samples under observation of a given input image I, and the conditional probability chain rule is used to calculate the probability p (SI; theta), where the expression is:
wherein S is0,S1,...,St-1,StRepresenting words in a sentence. For the unknown quantity p (S) in the formulat|I,S0,S1,...,St-1) Modeling was performed using a recurrent neural network.
(2) Constructing a Recurrent Neural Network (RNN):
with t-1 words as conditions, and representing the words as a hidden state h of fixed lengthtUntil a new input x appearstAnd updating the hidden state through a nonlinear function f, wherein the expression is as follows:
ht+1=f(ht,xt)(13)
wherein h ist+1Indicating the next hidden state.
(3) For the nonlinear function f, modeling is performed by constructing a long-time memory network (LSTM) as shown in FIG. 3;
wherein itFor inputting gate inputgate, ftTo forget the door for getgate, otFor the output gate output, c is the cell, and the update and output of each state can be expressed as:
it=σ(Wixxt+Wimmt-1)(14)
ft=σ(Wfxxt+Wfmmt-1)(15)
ot=σ(Woxxt+Wommt-1)(16)
pt+1=Softmax(mt)(19)
wherein,expressed as the product between gate values, the matrix W ═ Wix;Wim;Wfx;Wfm;Wox;Wom;Wcx;Wix;WcmIs the parameter to be trained, and σ (-) is an sigmoid function (e.g., σ (W)ixxt+Wimmt-1)、σ(Wfxxt+Wfmmt-1) As an sigmoid function), h (-) is a hyperbolic tangent function (e.g.: h (W)cxxt+Wcmmt-1) As a hyperbolic tangent function). p is a radical oft+1Is the probability distribution of the next word after the Softmax classification; m istIs a current state feature.
(4) And optimizing the objective function (11) by using a gradient descent mode, and obtaining a trained long-time and short-time memory network LSTM parameter W.
204: describing a video frame sequence of a video to be described by using the RNN model obtained by training to obtain a description sequence, wherein the predicting step comprises the following steps:
(1) extraction ofTest set NtFor the number of test set videos, t is the test set video, and 10 frames of images are extracted for each video, which can be expressed as:
(2) using the trained model parameters theta ═ { W }i,Bi,θj1, 5, j 1,2,3, and extracting Image using a CNN modeltThe CNN characteristic of each image is obtained to obtain an image characteristic It={It 1,…,It 10}。
(3) Characterizing an image ItUsing the model parameters W obtained by training as input, equation (12) is obtained, and the sentence description S ═ S is obtained1,…,Sn}. Thereby obtaining the sentence description corresponding to the video.
205: and sorting the reasonableness of the description sequence by using a LexRank method, and selecting the most reasonable description as a final description of the video.
(1) Video feature sequence I by using RNN modelt={It 1,…,It 10Testing to generate corresponding sentence set S ═ S1,…Si,…,Sn}。
(2) Generating sentence characteristics, and sequentially scanning each sentence S in all sentence setsiWherein i is 1, …, NdOne for each different word, forming a vocabulary of word list representations VOL ═ wi,…,wNwIn which N iswIs the total number of words in the vocabulary VOL. For each word w in the vocabulary VOLiSequentially scanning each sentence S in the set S of sentencesjCounting each word wiIn each sentence SjNumber of occurrences nijWhere j is 1, …, Ns,NsIs the total number of sentences and counts the words w contained in the set SiNumber of accompanying text num (w)i) (ii) a Calculate each word w according to equation (20)iIn each sentence SjWord frequency tf (w) of (1)i,sj) Where i is 1, …, Nd,NdIs the total number of words in the vocabulary, j is 1, …, Ns,NsIs the total number of all sentences S in the set;
wherein n iskjIs the number of occurrences of the kth word in the jth sentence.
For each word w in the vocabulary VOLiCalculating the inverse document word frequency idf (w) according to the formula (21)i);
idf(wi)=log(Nd/num(wi))(21)
Wherein N isdAs the number of words per sentence.
According to the vector space model, each sentence S in the set SjIs expressed as NwDimension vector, i-th dimension corresponding to word w in the vocabularyiThe value of tfidf (w)i) The calculation formula is as follows:
tfidf(wi)=tf(wi,sj)×idf(wi)(22)
(3) using two vectors Si,SjThe cosine value between them is used as sentence similarity, and the calculation formula is as follows:
wherein,in sentence S for each word wiThe word frequency of (1);in sentence S for each word wjThe word frequency of (1); idfwAn inverse document word frequency for each word; smAs a sentence SiAny one of the words;as a word smAt SiThe word frequency of (1);as a word smThe inverse document word frequency of; snAs a sentence SjAny one of the words;as a word snAt SjThe word frequency of (1);as a word snThe inverse document word frequency.
And form a fully connected undirected graph, as in FIG. 4(a), with each node uiAs a sentence SiAnd the edges between the nodes are taken as sentence similarity.
(4) A threshold value Degree is set, and all edges with similarity less than Degree are deleted, as shown in fig. 4 (b).
(5) Calculate each sentence node uiThe LexRank score LR, the initial score of each sentence node is: d/N, where N is the number of sentence nodes, d is the damping factor, and d is usually selected to be [0.1,0.2 ]]The score LR is calculated according to equation (4):
wherein deg (v) is the threshold of node v; LR (u) is the score for node u; LR (v) is the score for node v.
(6) And calculating the LR score of each sentence node, sequencing, and selecting the sentence with the highest score as the final description of the video.
In summary, the embodiment of the present invention, through steps 201 to 205, describes an event occurring in a segment of video and an object attribute related to the event through a natural language, so as to achieve the purpose of describing and summarizing video content.
Example 3
Two videos are selected as videos to be described, and as shown in fig. 5, the videos are predicted by using the method based on deep learning and text summarization in the present invention to output corresponding video descriptions:
(1) using ImageNet as a training set, each picture in the data set was sampled to 256 x 256 size pictures, as input, NmThe number of pictures.
(2) Building a first layer of convolution layer, setting the size of a convolution kernel cov1 to be 11, setting the step length stride to be 4, selecting ReLU to be max (0, x), carrying out posing operation on the convolved featuremap, setting the size of the kernel to be 3 and the step length stride to be 2, and normalizing the convolved data by using local corresponding normalization, wherein in AlexNet, k is 2, n is 5, α is 10-4,β=0.75。
(3) And building a second layer of convolution layer, setting the size of a convolution kernel cov2 to be 5, setting the step length stride to be 1, selecting ReLU to be max (0, x), performing posing operation on the convolved featuremap, setting the size of the kernel to be 3 and the step length stride to be 2, and normalizing the convolved data by using local corresponding normalization.
(4) And building a third layer of convolution layer, setting the size of a convolution kernel cov3 to be 3, the step length stride to be 1, and selecting ReLU to be max (0, x).
(5) And building a fourth layer of convolution layer, setting the size of a convolution kernel cov4 to be 3, the step length stride to be 1, and selecting ReLU to be max (0, x).
(6) And building a fifth layer convolution layer, setting the size of a convolution kernel cov5 to be 3, the step size stride to be 1, selecting ReLU to be max (0, x), and performing posing operation on the convolved featuremap, wherein the size of the kernel is 3, and the step size stride to be 2.
(7) And building a sixth full connection layer, setting the layer as fc6, selecting ReLU as max (0, x), and performing dropout on the processed data.
(8) And building a seventh full connection layer, setting the layer as fc7, selecting ReLU as max (0, x), and performing dropout on the processed data.
(9) And building an eighth fully-connected layer, setting the layer to fc8, and adding a Softmax classifier as an objective function.
(10) And establishing a Convolutional Neural Network (CNN) model by setting the eight-layer network layers.
(11) And training CNN model parameters.
(12) Data processing: each video in the data set was uniformly extracted into 10 frames and sampled to 256 x 256 size. Inputting the image into a trained CNN model to obtain image characteristics, wherein each frame of image randomly corresponds to 5 sentences of text expression of the video to be used as image-text pairs
(13) A Recurrent Neural Network (RNN) model is constructed.
Fig. 5 is a video text description result generated after the invention. The image in the figure is divided into video frames extracted from a video, and sentences corresponding to each frame of image are the results obtained after the video features pass through a language model. The lower part of the picture represents the sentence generated by the video feature and the image migration and the description of the video script only after being summarized.
In summary, the embodiment of the present invention converts the frame sequence of each video into a series of sentences through the convolutional neural network and the cyclic neural network, and selects high-quality and representative sentences from a plurality of sentences through a text summarization method. The user can obtain the description of the video by using the method, the description accuracy is high, and the method can be popularized to the retrieval of the video.
Reference to the literature
[1] KrizhevskyA, sutskever i, hintong. image classification method based on deep convolutional neural networks [ J ] neural information processing system evolution, 2012.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (4)
1. A video description method based on deep learning and text summarization is characterized by comprising the following steps:
downloading videos from the Internet, describing each video to form a pair of < videos, description > and form a text description training set;
training a convolutional neural network model according to an image classification task through an existing image data set;
extracting a video frame sequence from the video, extracting the characteristics of a convolutional neural network by using a convolutional neural network model, forming a < video frame sequence, a text description sequence > pair as the input of the recursive neural network model, and training to obtain the recursive neural network model;
describing a video frame sequence of a video to be described through a trained recurrent neural network model to obtain a description sequence;
and sequencing the description sequence by a method of taking the word centrality of the graph as the significance of the text summary, and outputting the final description result of the video.
2. The video description method based on deep learning and text summarization as claimed in claim 1, wherein the downloading of videos from the internet and the description of each video form < video, description > pairs, and the forming of the text description training set specifically comprises:
and forming a < video, description > pair by the existing video set and the sentence description corresponding to each video to form a text description training set.
3. The method according to claim 1, wherein the step of extracting a video frame sequence from the video, extracting the convolutional neural network features using a convolutional neural network model to form a < video frame sequence, text description sequence > pair as an input of the recursive neural network model, and training to obtain the recursive neural network model specifically comprises:
extracting the convolutional neural network characteristics of the image and sentence descriptions corresponding to the image for modeling by using the parameters after the convolutional neural network model is trained to obtain a target function;
constructing a recurrent neural network; modeling the nonlinear function through a long-time memory network;
and optimizing the objective function by using a gradient descent mode, and obtaining the trained long-time and short-time memory network parameters.
4. The video description method based on deep learning and text summarization according to claim 1, wherein the step of describing a sequence of video frames of a video to be described by the trained recurrent neural network model to obtain a description sequence specifically comprises:
extracting the convolutional neural network characteristics of each image by using the trained model parameters and the convolutional neural network model to obtain image characteristics;
and taking the image characteristics as input and obtaining sentence description by using the model parameters obtained by training so as to obtain the sentence description corresponding to the video.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510697454.3A CN105279495B (en) | 2015-10-23 | 2015-10-23 | A kind of video presentation method summarized based on deep learning and text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510697454.3A CN105279495B (en) | 2015-10-23 | 2015-10-23 | A kind of video presentation method summarized based on deep learning and text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105279495A true CN105279495A (en) | 2016-01-27 |
CN105279495B CN105279495B (en) | 2019-06-04 |
Family
ID=55148479
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510697454.3A Active CN105279495B (en) | 2015-10-23 | 2015-10-23 | A kind of video presentation method summarized based on deep learning and text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105279495B (en) |
Cited By (65)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105894043A (en) * | 2016-04-27 | 2016-08-24 | 上海高智科技发展有限公司 | Method and system for generating video description sentences |
CN106126492A (en) * | 2016-06-07 | 2016-11-16 | 北京高地信息技术有限公司 | Statement recognition methods based on two-way LSTM neutral net and device |
CN106227793A (en) * | 2016-07-20 | 2016-12-14 | 合网络技术(北京)有限公司 | A kind of video and the determination method and device of Video Key word degree of association |
CN106372107A (en) * | 2016-08-19 | 2017-02-01 | 中兴通讯股份有限公司 | Generation method and device of natural language sentence library |
CN106485251A (en) * | 2016-10-08 | 2017-03-08 | 天津工业大学 | Egg embryo classification based on deep learning |
CN106503055A (en) * | 2016-09-27 | 2017-03-15 | 天津大学 | A kind of generation method from structured text to iamge description |
CN106599198A (en) * | 2016-12-14 | 2017-04-26 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Image description method for multi-stage connection recurrent neural network |
CN106650756A (en) * | 2016-12-28 | 2017-05-10 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Image text description method based on knowledge transfer multi-modal recurrent neural network |
CN106650789A (en) * | 2016-11-16 | 2017-05-10 | 同济大学 | Image description generation method based on depth LSTM network |
CN106782602A (en) * | 2016-12-01 | 2017-05-31 | 南京邮电大学 | Speech-emotion recognition method based on length time memory network and convolutional neural networks |
CN106845411A (en) * | 2017-01-19 | 2017-06-13 | 清华大学 | A kind of video presentation generation method based on deep learning and probability graph model |
CN106886768A (en) * | 2017-03-02 | 2017-06-23 | 杭州当虹科技有限公司 | A kind of video fingerprinting algorithms based on deep learning |
CN106934352A (en) * | 2017-02-28 | 2017-07-07 | 华南理工大学 | A kind of video presentation method based on two-way fractal net work and LSTM |
CN107038221A (en) * | 2017-03-22 | 2017-08-11 | 杭州电子科技大学 | A kind of video content description method guided based on semantic information |
CN107203598A (en) * | 2017-05-08 | 2017-09-26 | 广州智慧城市发展研究院 | A kind of method and system for realizing image switch labels |
WO2017168252A1 (en) * | 2016-03-31 | 2017-10-05 | Maluuba Inc. | Method and system for processing an input query |
CN107292086A (en) * | 2016-04-07 | 2017-10-24 | 西门子保健有限责任公司 | Graphical analysis question and answer |
CN107291882A (en) * | 2017-06-19 | 2017-10-24 | 江苏软开信息科技有限公司 | A kind of data automatic statistical analysis method |
CN107368887A (en) * | 2017-07-25 | 2017-11-21 | 江西理工大学 | A kind of structure and its construction method of profound memory convolutional neural networks |
CN107391505A (en) * | 2016-05-16 | 2017-11-24 | 腾讯科技(深圳)有限公司 | A kind of image processing method and system |
CN107515900A (en) * | 2017-07-24 | 2017-12-26 | 宗晖(上海)机器人有限公司 | Intelligent robot and its event memorandum system and method |
CN107578062A (en) * | 2017-08-19 | 2018-01-12 | 四川大学 | A kind of picture based on attribute probability vector guiding attention mode describes method |
CN107609501A (en) * | 2017-09-05 | 2018-01-19 | 东软集团股份有限公司 | The close action identification method of human body and device, storage medium, electronic equipment |
CN107707931A (en) * | 2016-08-08 | 2018-02-16 | 阿里巴巴集团控股有限公司 | Generated according to video data and explain data, data synthesis method and device, electronic equipment |
CN107784372A (en) * | 2016-08-24 | 2018-03-09 | 阿里巴巴集团控股有限公司 | Forecasting Methodology, the device and system of destination object attribute |
CN107818306A (en) * | 2017-10-31 | 2018-03-20 | 天津大学 | A kind of video answering method based on attention model |
CN107844751A (en) * | 2017-10-19 | 2018-03-27 | 陕西师范大学 | The sorting technique of guiding filtering length Memory Neural Networks high-spectrum remote sensing |
CN108200483A (en) * | 2017-12-26 | 2018-06-22 | 中国科学院自动化研究所 | Dynamically multi-modal video presentation generation method |
CN108228686A (en) * | 2017-06-15 | 2018-06-29 | 北京市商汤科技开发有限公司 | It is used to implement the matched method, apparatus of picture and text and electronic equipment |
CN108307229A (en) * | 2018-02-02 | 2018-07-20 | 新华智云科技有限公司 | A kind of processing method and equipment of video-audio data |
CN108491208A (en) * | 2018-01-31 | 2018-09-04 | 中山大学 | A kind of code annotation sorting technique based on neural network model |
WO2018170671A1 (en) * | 2017-03-20 | 2018-09-27 | Intel Corporation | Topic-guided model for image captioning system |
CN108665055A (en) * | 2017-03-28 | 2018-10-16 | 上海荆虹电子科技有限公司 | A kind of figure says generation method and device |
CN108683924A (en) * | 2018-05-30 | 2018-10-19 | 北京奇艺世纪科技有限公司 | A kind of method and apparatus of video processing |
CN108734614A (en) * | 2017-04-13 | 2018-11-02 | 腾讯科技(深圳)有限公司 | Traffic congestion prediction technique and device, storage medium |
CN108765383A (en) * | 2018-03-22 | 2018-11-06 | 山西大学 | Video presentation method based on depth migration study |
CN108830287A (en) * | 2018-04-18 | 2018-11-16 | 哈尔滨理工大学 | The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method |
CN108881950A (en) * | 2018-05-30 | 2018-11-23 | 北京奇艺世纪科技有限公司 | A kind of method and apparatus of video processing |
WO2019024083A1 (en) * | 2017-08-04 | 2019-02-07 | Nokia Technologies Oy | Artificial neural network |
CN109522451A (en) * | 2018-12-13 | 2019-03-26 | 连尚(新昌)网络科技有限公司 | Repeat video detecting method and device |
CN109522531A (en) * | 2017-09-18 | 2019-03-26 | 腾讯科技(北京)有限公司 | Official documents and correspondence generation method and device, storage medium and electronic device |
CN109711022A (en) * | 2018-12-17 | 2019-05-03 | 哈尔滨工程大学 | A kind of submarine anti-sinking system based on deep learning |
CN109891897A (en) * | 2016-10-27 | 2019-06-14 | 诺基亚技术有限公司 | Method for analyzing media content |
CN109960747A (en) * | 2019-04-02 | 2019-07-02 | 腾讯科技(深圳)有限公司 | The generation method of video presentation information, method for processing video frequency, corresponding device |
CN110019952A (en) * | 2017-09-30 | 2019-07-16 | 华为技术有限公司 | Video presentation method, system and device |
CN110119750A (en) * | 2018-02-05 | 2019-08-13 | 浙江宇视科技有限公司 | Data processing method, device and electronic equipment |
CN110188779A (en) * | 2019-06-03 | 2019-08-30 | 中国矿业大学 | A kind of generation method of image, semantic description |
CN110210499A (en) * | 2019-06-03 | 2019-09-06 | 中国矿业大学 | A kind of adaptive generation system of image, semantic description |
US10445871B2 (en) | 2017-05-22 | 2019-10-15 | General Electric Company | Image analysis neural network systems |
CN110612537A (en) * | 2017-05-02 | 2019-12-24 | 柯达阿拉里斯股份有限公司 | System and method for batch normalized loop highway network |
CN110678816A (en) * | 2017-04-04 | 2020-01-10 | 西门子股份公司 | Method and control device for controlling a technical system |
CN110765921A (en) * | 2019-10-18 | 2020-02-07 | 北京工业大学 | Video object positioning method based on weak supervised learning and video spatiotemporal features |
CN110781345A (en) * | 2019-10-31 | 2020-02-11 | 北京达佳互联信息技术有限公司 | Video description generation model acquisition method, video description generation method and device |
CN111325068A (en) * | 2018-12-14 | 2020-06-23 | 北京京东尚科信息技术有限公司 | Video description method and device based on convolutional neural network |
CN111400545A (en) * | 2020-03-01 | 2020-07-10 | 西北工业大学 | Video annotation method based on deep learning |
CN111404676A (en) * | 2020-03-02 | 2020-07-10 | 北京丁牛科技有限公司 | Method and device for generating, storing and transmitting secure and secret key and cipher text |
CN111461974A (en) * | 2020-02-17 | 2020-07-28 | 天津大学 | Image scanning path control method based on L STM model from coarse to fine |
CN111488807A (en) * | 2020-03-29 | 2020-08-04 | 复旦大学 | Video description generation system based on graph convolution network |
CN111681676A (en) * | 2020-06-09 | 2020-09-18 | 杭州星合尚世影视传媒有限公司 | Method, system and device for identifying and constructing audio frequency by video object and readable storage medium |
WO2020220702A1 (en) * | 2019-04-29 | 2020-11-05 | 北京三快在线科技有限公司 | Generation of natural language |
CN111931690A (en) * | 2020-08-28 | 2020-11-13 | Oppo广东移动通信有限公司 | Model training method, device, equipment and storage medium |
WO2021056750A1 (en) * | 2019-09-29 | 2021-04-01 | 北京市商汤科技开发有限公司 | Search method and device, and storage medium |
CN113191262A (en) * | 2021-04-29 | 2021-07-30 | 桂林电子科技大学 | Video description data processing method, device and storage medium |
CN113641854A (en) * | 2021-07-28 | 2021-11-12 | 上海影谱科技有限公司 | Method and system for converting characters into video |
US11328512B2 (en) | 2019-09-30 | 2022-05-10 | Wipro Limited | Method and system for generating a text summary for a multimedia content |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8442927B2 (en) * | 2009-07-30 | 2013-05-14 | Nec Laboratories America, Inc. | Dynamically configurable, multi-ported co-processor for convolutional neural networks |
CN104113789A (en) * | 2014-07-10 | 2014-10-22 | 杭州电子科技大学 | On-line video abstraction generation method based on depth learning |
-
2015
- 2015-10-23 CN CN201510697454.3A patent/CN105279495B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8442927B2 (en) * | 2009-07-30 | 2013-05-14 | Nec Laboratories America, Inc. | Dynamically configurable, multi-ported co-processor for convolutional neural networks |
CN104113789A (en) * | 2014-07-10 | 2014-10-22 | 杭州电子科技大学 | On-line video abstraction generation method based on depth learning |
Non-Patent Citations (2)
Title |
---|
GUNES ERKAN: "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization", 《JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH》 * |
SUBHASHINI VENUGOPALAN等: "Translating Videos to Natural Language Using Deep Recurrent Neural Networks", 《COMPUTER SCIENCE》 * |
Cited By (106)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10437929B2 (en) | 2016-03-31 | 2019-10-08 | Maluuba Inc. | Method and system for processing an input query using a forward and a backward neural network specific to unigrams |
WO2017168252A1 (en) * | 2016-03-31 | 2017-10-05 | Maluuba Inc. | Method and system for processing an input query |
CN107292086A (en) * | 2016-04-07 | 2017-10-24 | 西门子保健有限责任公司 | Graphical analysis question and answer |
CN105894043A (en) * | 2016-04-27 | 2016-08-24 | 上海高智科技发展有限公司 | Method and system for generating video description sentences |
CN107391505A (en) * | 2016-05-16 | 2017-11-24 | 腾讯科技(深圳)有限公司 | A kind of image processing method and system |
CN107391505B (en) * | 2016-05-16 | 2020-10-23 | 腾讯科技(深圳)有限公司 | Image processing method and system |
CN106126492A (en) * | 2016-06-07 | 2016-11-16 | 北京高地信息技术有限公司 | Statement recognition methods based on two-way LSTM neutral net and device |
CN106126492B (en) * | 2016-06-07 | 2019-02-05 | 北京高地信息技术有限公司 | Sentence recognition methods and device based on two-way LSTM neural network |
CN106227793A (en) * | 2016-07-20 | 2016-12-14 | 合网络技术(北京)有限公司 | A kind of video and the determination method and device of Video Key word degree of association |
CN106227793B (en) * | 2016-07-20 | 2019-10-22 | 优酷网络技术(北京)有限公司 | A kind of determination method and device of video and the Video Key word degree of correlation |
CN107707931A (en) * | 2016-08-08 | 2018-02-16 | 阿里巴巴集团控股有限公司 | Generated according to video data and explain data, data synthesis method and device, electronic equipment |
CN106372107B (en) * | 2016-08-19 | 2020-01-17 | 中兴通讯股份有限公司 | Method and device for generating natural language sentence library |
CN106372107A (en) * | 2016-08-19 | 2017-02-01 | 中兴通讯股份有限公司 | Generation method and device of natural language sentence library |
CN107784372A (en) * | 2016-08-24 | 2018-03-09 | 阿里巴巴集团控股有限公司 | Forecasting Methodology, the device and system of destination object attribute |
CN107784372B (en) * | 2016-08-24 | 2022-02-22 | 阿里巴巴集团控股有限公司 | Target object attribute prediction method, device and system |
CN106503055B (en) * | 2016-09-27 | 2019-06-04 | 天津大学 | A kind of generation method from structured text to iamge description |
CN106503055A (en) * | 2016-09-27 | 2017-03-15 | 天津大学 | A kind of generation method from structured text to iamge description |
CN106485251B (en) * | 2016-10-08 | 2019-12-24 | 天津工业大学 | Egg embryo classification based on deep learning |
CN106485251A (en) * | 2016-10-08 | 2017-03-08 | 天津工业大学 | Egg embryo classification based on deep learning |
CN109891897B (en) * | 2016-10-27 | 2021-11-05 | 诺基亚技术有限公司 | Method for analyzing media content |
US11068722B2 (en) | 2016-10-27 | 2021-07-20 | Nokia Technologies Oy | Method for analysing media content to generate reconstructed media content |
CN109891897A (en) * | 2016-10-27 | 2019-06-14 | 诺基亚技术有限公司 | Method for analyzing media content |
CN106650789B (en) * | 2016-11-16 | 2023-04-07 | 同济大学 | Image description generation method based on depth LSTM network |
CN106650789A (en) * | 2016-11-16 | 2017-05-10 | 同济大学 | Image description generation method based on depth LSTM network |
CN106782602A (en) * | 2016-12-01 | 2017-05-31 | 南京邮电大学 | Speech-emotion recognition method based on length time memory network and convolutional neural networks |
CN106599198B (en) * | 2016-12-14 | 2021-04-06 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Image description method of multi-cascade junction cyclic neural network |
CN106599198A (en) * | 2016-12-14 | 2017-04-26 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Image description method for multi-stage connection recurrent neural network |
CN106650756B (en) * | 2016-12-28 | 2019-12-10 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | knowledge migration-based image text description method of multi-mode recurrent neural network |
CN106650756A (en) * | 2016-12-28 | 2017-05-10 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Image text description method based on knowledge transfer multi-modal recurrent neural network |
CN106845411B (en) * | 2017-01-19 | 2020-06-30 | 清华大学 | Video description generation method based on deep learning and probability map model |
CN106845411A (en) * | 2017-01-19 | 2017-06-13 | 清华大学 | A kind of video presentation generation method based on deep learning and probability graph model |
CN106934352A (en) * | 2017-02-28 | 2017-07-07 | 华南理工大学 | A kind of video presentation method based on two-way fractal net work and LSTM |
CN106886768A (en) * | 2017-03-02 | 2017-06-23 | 杭州当虹科技有限公司 | A kind of video fingerprinting algorithms based on deep learning |
WO2018170671A1 (en) * | 2017-03-20 | 2018-09-27 | Intel Corporation | Topic-guided model for image captioning system |
CN107038221A (en) * | 2017-03-22 | 2017-08-11 | 杭州电子科技大学 | A kind of video content description method guided based on semantic information |
CN108665055B (en) * | 2017-03-28 | 2020-10-23 | 深圳荆虹科技有限公司 | Method and device for generating graphic description |
CN108665055A (en) * | 2017-03-28 | 2018-10-16 | 上海荆虹电子科技有限公司 | A kind of figure says generation method and device |
US10983485B2 (en) | 2017-04-04 | 2021-04-20 | Siemens Aktiengesellschaft | Method and control device for controlling a technical system |
CN110678816B (en) * | 2017-04-04 | 2021-02-19 | 西门子股份公司 | Method and control device for controlling a technical system |
CN110678816A (en) * | 2017-04-04 | 2020-01-10 | 西门子股份公司 | Method and control device for controlling a technical system |
CN108734614A (en) * | 2017-04-13 | 2018-11-02 | 腾讯科技(深圳)有限公司 | Traffic congestion prediction technique and device, storage medium |
CN110612537A (en) * | 2017-05-02 | 2019-12-24 | 柯达阿拉里斯股份有限公司 | System and method for batch normalized loop highway network |
CN107203598A (en) * | 2017-05-08 | 2017-09-26 | 广州智慧城市发展研究院 | A kind of method and system for realizing image switch labels |
US10445871B2 (en) | 2017-05-22 | 2019-10-15 | General Electric Company | Image analysis neural network systems |
CN108228686A (en) * | 2017-06-15 | 2018-06-29 | 北京市商汤科技开发有限公司 | It is used to implement the matched method, apparatus of picture and text and electronic equipment |
CN108228686B (en) * | 2017-06-15 | 2021-03-23 | 北京市商汤科技开发有限公司 | Method and device for realizing image-text matching and electronic equipment |
CN107291882A (en) * | 2017-06-19 | 2017-10-24 | 江苏软开信息科技有限公司 | A kind of data automatic statistical analysis method |
CN107515900A (en) * | 2017-07-24 | 2017-12-26 | 宗晖(上海)机器人有限公司 | Intelligent robot and its event memorandum system and method |
CN107368887A (en) * | 2017-07-25 | 2017-11-21 | 江西理工大学 | A kind of structure and its construction method of profound memory convolutional neural networks |
CN107368887B (en) * | 2017-07-25 | 2020-08-07 | 江西理工大学 | Deep memory convolutional neural network device and construction method thereof |
US11481625B2 (en) | 2017-08-04 | 2022-10-25 | Nokia Technologies Oy | Artificial neural network |
WO2019024083A1 (en) * | 2017-08-04 | 2019-02-07 | Nokia Technologies Oy | Artificial neural network |
CN107578062A (en) * | 2017-08-19 | 2018-01-12 | 四川大学 | A kind of picture based on attribute probability vector guiding attention mode describes method |
CN107609501A (en) * | 2017-09-05 | 2018-01-19 | 东软集团股份有限公司 | The close action identification method of human body and device, storage medium, electronic equipment |
CN109522531A (en) * | 2017-09-18 | 2019-03-26 | 腾讯科技(北京)有限公司 | Official documents and correspondence generation method and device, storage medium and electronic device |
CN109522531B (en) * | 2017-09-18 | 2023-04-07 | 腾讯科技(北京)有限公司 | Document generation method and device, storage medium and electronic device |
CN110019952B (en) * | 2017-09-30 | 2023-04-18 | 华为技术有限公司 | Video description method, system and device |
CN110019952A (en) * | 2017-09-30 | 2019-07-16 | 华为技术有限公司 | Video presentation method, system and device |
CN107844751A (en) * | 2017-10-19 | 2018-03-27 | 陕西师范大学 | The sorting technique of guiding filtering length Memory Neural Networks high-spectrum remote sensing |
CN107844751B (en) * | 2017-10-19 | 2021-08-27 | 陕西师范大学 | Method for classifying hyperspectral remote sensing images of guide filtering long and short memory neural network |
CN107818306B (en) * | 2017-10-31 | 2020-08-07 | 天津大学 | Video question-answering method based on attention model |
CN107818306A (en) * | 2017-10-31 | 2018-03-20 | 天津大学 | A kind of video answering method based on attention model |
CN108200483A (en) * | 2017-12-26 | 2018-06-22 | 中国科学院自动化研究所 | Dynamically multi-modal video presentation generation method |
CN108200483B (en) * | 2017-12-26 | 2020-02-28 | 中国科学院自动化研究所 | Dynamic multi-modal video description generation method |
CN108491208A (en) * | 2018-01-31 | 2018-09-04 | 中山大学 | A kind of code annotation sorting technique based on neural network model |
CN108307229A (en) * | 2018-02-02 | 2018-07-20 | 新华智云科技有限公司 | A kind of processing method and equipment of video-audio data |
CN108307229B (en) * | 2018-02-02 | 2023-12-22 | 新华智云科技有限公司 | Video and audio data processing method and device |
CN110119750A (en) * | 2018-02-05 | 2019-08-13 | 浙江宇视科技有限公司 | Data processing method, device and electronic equipment |
CN108765383A (en) * | 2018-03-22 | 2018-11-06 | 山西大学 | Video presentation method based on depth migration study |
CN108765383B (en) * | 2018-03-22 | 2022-03-18 | 山西大学 | Video description method based on deep migration learning |
CN108830287A (en) * | 2018-04-18 | 2018-11-16 | 哈尔滨理工大学 | The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method |
CN108881950B (en) * | 2018-05-30 | 2021-05-25 | 北京奇艺世纪科技有限公司 | Video processing method and device |
CN108881950A (en) * | 2018-05-30 | 2018-11-23 | 北京奇艺世纪科技有限公司 | A kind of method and apparatus of video processing |
CN108683924A (en) * | 2018-05-30 | 2018-10-19 | 北京奇艺世纪科技有限公司 | A kind of method and apparatus of video processing |
CN109522451B (en) * | 2018-12-13 | 2024-02-27 | 连尚(新昌)网络科技有限公司 | Repeated video detection method and device |
CN109522451A (en) * | 2018-12-13 | 2019-03-26 | 连尚(新昌)网络科技有限公司 | Repeat video detecting method and device |
CN111325068B (en) * | 2018-12-14 | 2023-11-07 | 北京京东尚科信息技术有限公司 | Video description method and device based on convolutional neural network |
CN111325068A (en) * | 2018-12-14 | 2020-06-23 | 北京京东尚科信息技术有限公司 | Video description method and device based on convolutional neural network |
CN109711022A (en) * | 2018-12-17 | 2019-05-03 | 哈尔滨工程大学 | A kind of submarine anti-sinking system based on deep learning |
CN109960747B (en) * | 2019-04-02 | 2022-12-16 | 腾讯科技(深圳)有限公司 | Video description information generation method, video processing method and corresponding devices |
CN109960747A (en) * | 2019-04-02 | 2019-07-02 | 腾讯科技(深圳)有限公司 | The generation method of video presentation information, method for processing video frequency, corresponding device |
US11861886B2 (en) | 2019-04-02 | 2024-01-02 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for generating video description information, and method and apparatus for video processing |
WO2020220702A1 (en) * | 2019-04-29 | 2020-11-05 | 北京三快在线科技有限公司 | Generation of natural language |
CN110210499B (en) * | 2019-06-03 | 2023-10-13 | 中国矿业大学 | Self-adaptive generation system for image semantic description |
CN110188779A (en) * | 2019-06-03 | 2019-08-30 | 中国矿业大学 | A kind of generation method of image, semantic description |
CN110210499A (en) * | 2019-06-03 | 2019-09-06 | 中国矿业大学 | A kind of adaptive generation system of image, semantic description |
WO2021056750A1 (en) * | 2019-09-29 | 2021-04-01 | 北京市商汤科技开发有限公司 | Search method and device, and storage medium |
US11328512B2 (en) | 2019-09-30 | 2022-05-10 | Wipro Limited | Method and system for generating a text summary for a multimedia content |
CN110765921A (en) * | 2019-10-18 | 2020-02-07 | 北京工业大学 | Video object positioning method based on weak supervised learning and video spatiotemporal features |
CN110765921B (en) * | 2019-10-18 | 2022-04-19 | 北京工业大学 | Video object positioning method based on weak supervised learning and video spatiotemporal features |
CN110781345B (en) * | 2019-10-31 | 2022-12-27 | 北京达佳互联信息技术有限公司 | Video description generation model obtaining method, video description generation method and device |
CN110781345A (en) * | 2019-10-31 | 2020-02-11 | 北京达佳互联信息技术有限公司 | Video description generation model acquisition method, video description generation method and device |
CN111461974A (en) * | 2020-02-17 | 2020-07-28 | 天津大学 | Image scanning path control method based on L STM model from coarse to fine |
CN111461974B (en) * | 2020-02-17 | 2023-04-25 | 天津大学 | Image scanning path control method based on LSTM model from coarse to fine |
CN111400545A (en) * | 2020-03-01 | 2020-07-10 | 西北工业大学 | Video annotation method based on deep learning |
CN111404676B (en) * | 2020-03-02 | 2023-08-29 | 北京丁牛科技有限公司 | Method and device for generating, storing and transmitting secret key and ciphertext |
CN111404676A (en) * | 2020-03-02 | 2020-07-10 | 北京丁牛科技有限公司 | Method and device for generating, storing and transmitting secure and secret key and cipher text |
CN111488807A (en) * | 2020-03-29 | 2020-08-04 | 复旦大学 | Video description generation system based on graph convolution network |
CN111488807B (en) * | 2020-03-29 | 2023-10-10 | 复旦大学 | Video description generation system based on graph rolling network |
CN111681676B (en) * | 2020-06-09 | 2023-08-08 | 杭州星合尚世影视传媒有限公司 | Method, system, device and readable storage medium for constructing audio frequency by video object identification |
CN111681676A (en) * | 2020-06-09 | 2020-09-18 | 杭州星合尚世影视传媒有限公司 | Method, system and device for identifying and constructing audio frequency by video object and readable storage medium |
CN111931690A (en) * | 2020-08-28 | 2020-11-13 | Oppo广东移动通信有限公司 | Model training method, device, equipment and storage medium |
CN111931690B (en) * | 2020-08-28 | 2024-08-13 | Oppo广东移动通信有限公司 | Model training method, device, equipment and storage medium |
CN113191262A (en) * | 2021-04-29 | 2021-07-30 | 桂林电子科技大学 | Video description data processing method, device and storage medium |
CN113641854B (en) * | 2021-07-28 | 2023-09-26 | 上海影谱科技有限公司 | Method and system for converting text into video |
CN113641854A (en) * | 2021-07-28 | 2021-11-12 | 上海影谱科技有限公司 | Method and system for converting characters into video |
Also Published As
Publication number | Publication date |
---|---|
CN105279495B (en) | 2019-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105279495B (en) | A kind of video presentation method summarized based on deep learning and text | |
CN106503055B (en) | A kind of generation method from structured text to iamge description | |
CN108829822B (en) | Media content recommendation method and device, storage medium and electronic device | |
CN109670039B (en) | Semi-supervised e-commerce comment emotion analysis method based on three-part graph and cluster analysis | |
US20210256051A1 (en) | Theme classification method based on multimodality, device, and storage medium | |
CN108073568B (en) | Keyword extraction method and device | |
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
CN111783474B (en) | Comment text viewpoint information processing method and device and storage medium | |
CN107943784B (en) | Relationship extraction method based on generation of countermeasure network | |
CN112270196B (en) | Entity relationship identification method and device and electronic equipment | |
CN111143576A (en) | Event-oriented dynamic knowledge graph construction method and device | |
CN110532379B (en) | Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis | |
CN112819023B (en) | Sample set acquisition method, device, computer equipment and storage medium | |
CN110851641B (en) | Cross-modal retrieval method and device and readable storage medium | |
CN112989802B (en) | Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium | |
CN110750648A (en) | Text emotion classification method based on deep learning and feature fusion | |
CN110263854A (en) | Live streaming label determines method, apparatus and storage medium | |
CN110377778A (en) | Figure sort method, device and electronic equipment based on title figure correlation | |
CN113239159A (en) | Cross-modal retrieval method of videos and texts based on relational inference network | |
CN113343690A (en) | Text readability automatic evaluation method and device | |
CN113051932A (en) | Method for detecting category of network media event of semantic and knowledge extension topic model | |
CN110717090A (en) | Network public praise evaluation method and system for scenic spots and electronic equipment | |
CN111507089A (en) | Document classification method and device based on deep learning model and computer equipment | |
CN114417851A (en) | Emotion analysis method based on keyword weighted information | |
CN115775349A (en) | False news detection method and device based on multi-mode fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220322 Address after: 511400 4th floor, No. 685, Shiqiao South Road, Panyu District, Guangzhou, Guangdong Patentee after: GUANGZHOU WELLTHINKER AUTOMATION TECHNOLOGY CO.,LTD. Address before: 300072 Tianjin City, Nankai District Wei Jin Road No. 92 Patentee before: Tianjin University |
|
TR01 | Transfer of patent right |