CN106934352A - A kind of video presentation method based on two-way fractal net work and LSTM - Google Patents
A kind of video presentation method based on two-way fractal net work and LSTM Download PDFInfo
- Publication number
- CN106934352A CN106934352A CN201710111507.8A CN201710111507A CN106934352A CN 106934352 A CN106934352 A CN 106934352A CN 201710111507 A CN201710111507 A CN 201710111507A CN 106934352 A CN106934352 A CN 106934352A
- Authority
- CN
- China
- Prior art keywords
- video
- network
- fractal
- lstm
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000000306 recurrent effect Effects 0.000 claims abstract description 26
- 230000014509 gene expression Effects 0.000 claims abstract description 22
- 238000013528 artificial neural network Methods 0.000 claims abstract description 17
- 238000005070 sampling Methods 0.000 claims abstract description 12
- 230000000007 visual effect Effects 0.000 claims abstract description 8
- 230000003287 optical effect Effects 0.000 claims description 31
- 239000013598 vector Substances 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 21
- 210000002569 neuron Anatomy 0.000 claims description 18
- 238000003062 neural network model Methods 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 9
- 230000006978 adaptation Effects 0.000 claims description 8
- 230000006399 behavior Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 4
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000013329 compounding Methods 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000003278 mimic effect Effects 0.000 claims description 3
- 230000001537 neural effect Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 abstract description 2
- 238000005065 mining Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 6
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of video presentation method based on two-way fractal net work and LSTM.Methods described carries out the sampling of key frame to video to be described, first, and extract the Optical-flow Feature between adjacent two frame of former video, then learnt respectively by two fractal net works and obtain frame of video and the expression of the high-level characteristic of Optical-flow Feature, two recurrent neural networks models based on LSTM units are separately input to again, finally the output valve at two each moment of independent model is weighted averagely, so as to obtain descriptive statement corresponding with the video.The present invention has been utilized respectively the information of former frame of video and light stream to video to be described, the Optical-flow Feature of addition compensate for the multidate information that sample frame is inevitably lost, it is contemplated that change of the video on Spatial Dimension and time dimension.Furthermore, abstract visual signature is carried out to low-level image feature by novel fractal net work and is expressed, so that people, thing, behavior and spatial relation for being more accurately related in analysis mining video etc. are contacted.
Description
Technical Field
The invention belongs to the technical field of video description and deep learning, and particularly relates to a video description method based on a two-way fractal network and an LSTM.
Background
With the progress of science and technology and the development of society, various video camera terminals, especially smart phones, are very popular, and the price of hardware storage is increasingly low, so that the multimedia information flow is exponentially increased. In the presence of a large amount of video information streams, how to perform efficient and automatic analysis, recognition and understanding on massive video information with the human intervention minimized, so as to describe semantically, has become a popular topic in the current image processing and computer vision research fields. It may be a simple matter for most people to describe a video in language after watching a brief video. However, it is a challenging task for a machine to generate a natural language description by extracting pixel information of each frame of image in a video, analyzing and processing the pixel information.
The machine can efficiently and automatically describe the video, and has wide application prospects in the field of computer vision such as video retrieval, man-machine interaction, traffic security and the like, so that the research of people on semantic description of the video is further promoted.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a video description method based on a two-way fractal network and an LSTM.
In order to achieve the purpose, the invention adopts the following technical scheme:
a video description method based on two-way fractal network and LSTM is characterized in that firstly, sampling key frames of a video to be described is carried out, optical flow characteristics between two adjacent frames of an original video are extracted, then, high-level characteristic expressions of the key frames and the optical flow characteristics are obtained through learning of the two fractal networks respectively, then, the high-level characteristic expressions are input into two recurrent neural network models based on an LSTM unit respectively, and finally, weighted average is carried out on output values of the two independent recurrent neural network models at each moment, so that description sentences corresponding to the video are obtained. The method specifically comprises the following steps:
s1, sampling key frames of a video to be described, and extracting optical flow characteristics between two adjacent frames of an original video;
s2, learning and obtaining high-level feature expressions of video frames and optical flow features through two fractal networks respectively;
s3, respectively inputting the high-level feature vectors obtained in the previous step into two recurrent neural networks based on an LSTM unit;
and S4, carrying out weighted average on the output values of the two independent models at each moment and obtaining the description sentences corresponding to the video.
Preferably, in step S1, the extracting optical flow features of the video to be described specifically includes:
s1.1, respectively calculating optical flow characteristic values in the x direction and the y direction of every two adjacent frames of the video, and normalizing to a pixel range of [0,255 ];
and S1.2, calculating the amplitude value of the optical flow, and combining the optical flow characteristic values obtained in the last step to form an optical flow graph.
Preferably, the specific steps of obtaining the high-level feature expression of the key frame and the optical flow feature in step S2 are as follows:
s2.1, sequentially inputting the key frames of the video obtained in the step S1 to a fractal network for processing the spatial dimension relation in a time point sequence, and sequentially generating corresponding visual feature vectors through the nonlinear mapping relation of the network;
and S2.2, sequentially inputting the optical flow graph obtained in the step S1 to a second fractal network for processing the time dimension relation in the time point sequence, and sequentially generating corresponding motion characteristic vectors through the nonlinear mapping relation of the network.
Preferably, for repeated application in steps S2.1 and S2.2 by a single extension rule, an extremely deep network is generated whose structural layout is a truncated fractal; the network comprises interacting sub-paths of different lengths but does not comprise any through connections; meanwhile, in order to realize the capability of extracting the high-performance fixed-depth sub-network, a path abandoning method is adopted to regularize the rule of cooperative adaptation of the sub-paths in the fractal architecture; for fractal networks, the simplicity of training corresponds to the simplicity of design, with a single loss function connected to the last layer sufficient to drive internal behavior to mimic deep supervision; the adopted fractal network is a deep convolutional neural network based on a fractal structure.
Preferably, the repeated application of the single extension rule in steps S2.1 and S2.2 generates an extremely deep network whose structural layout is a truncated fractal specifically:
basic situation f1(z) layers comprising a single selected type between input and output; let C denote a cutoff scoreShape fCIndex of (c), fC(. cndot.) defines the network architecture, connections, and layer types. Wherein the basic case is a network representation comprising a single convolutional layer as in equation (1-1):
f1(z)=conv(z) (1-1)
the following fractal is recursively defined as in equation (1-2):
in the formula (1-2), the first and second,represents compounding ofIndicating a connection operation, C corresponding to the number of columns, or network fCWidth of (·); depth is defined as the number of conv layers on the longest path from input to output, proportional to 2C-1(ii) a Convolutional networks for classification typically have a distributed arrangement of layers of aggregation; for the same purpose, use is made of fCAs a building unit, stack it with the next collection layer B times, resulting in a total depth B.2C-1(ii) a Connecting operationCombining the two feature blocks into one; a feature block is the result of a conv layer: maintaining an activated tensor for fixed channels in a spatial region; the number of channels corresponds to the number of filters of the preceding conv layer; when the parts are spread, combining the adjacent connections into a single connection layer; the connection layer merges all its input feature blocks into a single output block.
Preferably, the rule for regularizing the cooperative adaptation of the sub-paths in the fractal architecture by using the path discarding method in steps S2.1 and S2.2 specifically includes: because the fractal network comprises an additional large-scale structure, a coarse-grained regularization strategy similar to dropout and drop-connect is used, and paths abandon that the common adaptation of parallel paths is forbidden by randomly discarding operands of a connection layer, so that the mode effectively prevents the network from using one path as an anchor and the other path as a correction to possibly cause over-fitting behavior; two sampling strategies were employed:
for local, the connectivity layer discards each input with a fixed probability, but guarantees that at least one input is retained;
for global, each path is chosen for the entire network, and by restricting this path to being single-column, each column is motivated to be a powerful predictor.
Preferably, the step S3 of inputting the high-level feature vector into two LSTM unit-based recurrent neural network models is specifically:
the recurrent neural network based on the LSTM units includes two layers of LSTM units, the first layer and the second layer respectively include 1000 neurons, wherein the forward propagation process of each LSTM neural unit can be expressed as:
it=σ(Wxixt+Whiht-1+bi) (1-3)
ft=σ(Wxfxt+Whfht-1+bf) (1-4)
ot=σ(Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt(1-7)
wherein, σ (x)t)=(1+e-xt)-1Is a sigmoid non-linear activation function,is a hyperbolic tangent nonlinear activation function; i.e. it,ft,ot,ctRespectively representing the state quantities corresponding to the input gate, the memory gate, the output gate and the core gate at the time t; for each logic gate, Wxi,Wxf,Wxo,WxgRespectively representing the weight transfer matrices, W, corresponding to the input gate, memory gate, output gate and core gatehi,Whf,Who,WhgRespectively representing hidden layer variable h of input gate, memory gate, output gate and core gate at time t-1t-1Corresponding weight transfer matrix, bi,bf,bo,bgRespectively representing the corresponding offset vectors of the input gate, the memory gate, the output gate and the core gate.
Preferably, the neural network model structure in step S3 is:
based on the recurrent neural network structure diagram of the two layers of LSTM units, the recurrent neural network of the LSTM unit stacked by two layers is utilized to carry out the operation of coding and decoding the input feature vector, thereby realizing the conversion of the natural language text; the first layer of LSTM neurons finish the coding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the input of the second layer of LSTM neurons; when the feature vectors of all video frames are input into the first layer of LSTM neurons, the second layer of LSTM neurons can receive an indicator and start a decoding task; in the decoding stage, the network has information loss, so the goal of model parameter training and learning is to maximize the log-likelihood function of the whole output statement prediction on the premise of giving hidden layer expression and output prediction at the previous moment; for output sentence Y with parameter θ1,y2,…,ym) The model, the parametric optimization objective, can be expressed as:
here, θ is a parameter, Y represents an output prediction statement, h is a hidden layer expression, the objective function is optimized by using a random gradient descent method, and errors of the whole network are cumulatively transferred in a time dimension by a back propagation algorithm.
Preferably, in step S4, the specific operation of weighted averaging the output values of the two neural network independent models at each time point and obtaining the description sentence corresponding to the video is:
s4.1, carrying out weighted average on output values of second-layer LSTM neurons at each moment of the two independent recurrent neural network models;
s4.2, calculating the occurrence probability of each word in the vocabulary V by adopting a softmax function, wherein the probability is expressed as:
where y denotes the predicted word, ztRepresenting the output value, W, of the recurrent neural network at time tyRepresenting the weight value of the word in the vocabulary.
And S4.3, in the decoding stage of each moment, taking the word with the maximum probability in the output value of the softmax function, thereby obtaining the corresponding video description sentence.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the optical flow characteristics added by the invention compensate the dynamic information which is inevitably lost by the sampling frames, and the change of the video in the space dimension and the time dimension is considered.
2. The video description method based on the two-way fractal network and the LSTM can automatically generate a descriptive language about video contents from end to end by processing any input video, and can be applied to the application fields of video retrieval, video monitoring, human-computer interaction and the like.
3. The invention carries out abstract visual feature expression on the bottom layer features through a novel fractal network, thereby more accurately analyzing and mining the relation of people, objects, behaviors, spatial position relation and the like in the video.
Drawings
Fig. 1 is a flow framework diagram of a two-way fractal network and LSTM based video description method provided by the present invention;
FIG. 2 is a schematic diagram of a fractal subnetwork used in an embodiment of the present invention;
FIG. 3 is a schematic diagram of an LSTM unit based recurrent neural network employed by an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
The method comprises the steps of sampling key frames of a video to be described, extracting optical flow characteristics between two adjacent frames of an original video, learning and obtaining high-level characteristic expressions of the key frames and the optical flow characteristics through two fractal networks, inputting the high-level characteristic expressions into two recurrent neural network models based on an LSTM unit, and performing weighted average on output values of the two independent recurrent neural network models at each moment to obtain a description sentence corresponding to the video.
FIG. 1 is an overall flow chart of the present invention, comprising the steps of:
(1) sampling key frames of a video to be described, and extracting optical flow characteristics between two adjacent frames of an original video; wherein, the specific operation of extracting the optical flow characteristic of the video to be described is as follows:
1. respectively calculating the light flow values in the x direction and the y direction of every two adjacent frames of the video, and normalizing to the pixel range of [0,255 ];
2. and calculating the magnitude value of the optical flow, and combining the optical flow characteristic values obtained in the last step to form an optical flow graph.
(2) And respectively learning and obtaining high-level feature expressions of video frames and optical flow features through two fractal networks. Sequentially inputting the sampling frames of the video obtained in the first step to a fractal network for processing the spatial dimension relation in the order of time points, and sequentially generating corresponding visual feature vectors through the nonlinear mapping relation of the network; and sequentially inputting the obtained light flow diagrams to a second fractal network for processing the time dimension relation in the time point sequence, and sequentially generating corresponding motion characteristic vectors through the nonlinear mapping relation of the network.
The fractal network is mainly characterized in that a self-similarity-based design strategy is introduced on a macroscopic framework of a neural network, an extremely deep network is generated through repeated application of a single expansion rule, and the structural layout of the fractal network is a truncated fractal. The network comprises interacting sub-paths of different lengths but does not comprise any through connections. Meanwhile, in order to realize the capability of extracting high-performance fixed-depth sub-networks, a path rejection method is adopted to regularize the cooperative adaptation of sub-paths in the fractal architecture. For fractal networks, the simplicity of training corresponds to the simplicity of the design, with a single loss function connected to the last layer sufficient to drive internal behavior to mimic deep supervision. The fractal network adopted in the invention is a deep convolutional neural network based on a fractal structure.
As shown in FIG. 2, a schematic diagram of a fractal structure, a basic scenario f1(z) layers comprising a single selected type between input and output; let C denote the truncated fractal fCIndex of (c), fC(. cndot.) defines the network architecture, connections, and layer types. It is composed ofThe basic case is a network representation containing a single convolutional layer as in equation (1-1):
f1(z)=conv(z) (1-1)
the following fractal structure is then defined by recursion as in equation (1-2):
in the formula (1-2), the first and second,represents compounding ofIndicating a connection operation, C corresponding to the number of columns, or network fCWidth of (·); depth is defined as the number of conv layers on the longest path from input to output, proportional to 2C-1(ii) a Convolutional networks for classification typically have a distributed arrangement of layers of aggregation; for the same purpose, use is made of fCAs a building unit, stack it with the next collection layer B times, resulting in a total depth B.2C-1(ii) a Connecting operationCombining two feature blocks into one, wherein one feature block is the result of one convolutional layer: the activated tensor is maintained for fixed channels in a spatial region. The number of channels corresponds to the number of preceding convolutional layer filters. When the portions are extended, adjacent connections merge into a single connection layer. As shown on the right side of fig. 2, this connectivity layer spans multiple columns, merging all of its input feature blocks into a single output block.
Since fractal networks contain additional large scale structures, it is proposed to use a coarse-grained regularization strategy like dropout and drop-connect. Path dropping inhibits the common adaptation of parallel paths by randomly dropping operands at the link layer, which effectively prevents the network from using one path as an anchor and another path as a correction that may cause overfitting behavior. Here, two sampling strategies are mainly used:
for local, the connectivity layer discards each input with a fixed probability, but guarantees that at least one input is retained;
for global, each path is chosen for the entire network, and by restricting this path to being single-column, each column is motivated to be a powerful predictor.
(3) And respectively inputting the high-level feature vectors obtained in the last step into two recurrent neural networks based on the LSTM unit. The recurrent neural network based on the LSTM units includes two layers of LSTM units, the first layer and the second layer respectively include 1000 neurons, wherein the forward propagation process of each LSTM neural unit can be expressed as:
it=σ(Wxixt+Whiht-1+bi) (1-3)
ft=σ(Wxfxt+Whfht-1+bf) (1-4)
ot=σ(Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt(1-7)
wherein, σ (x)t)=(1+e-xt)-1Is a sigmoid non-linear activation function,is a hyperbolic tangent nonlinear activation function; i.e. it,ft,ot,ctRespectively representing the state quantities corresponding to the input gate, the memory gate, the output gate and the core gate at the time t; for each logic gate, Wxi,Wxf,Wxo,WxgRespectively representing the weight transfer matrices, W, corresponding to the input gate, memory gate, output gate and core gatehi,Whf,Who,WhgRespectively representing hidden layer variable h of input gate, memory gate, output gate and core gate at time t-1t-1Corresponding weight transfer matrix, bi,bf,bo,bgRespectively representing the corresponding offset vectors of the input gate, the memory gate, the output gate and the core gate.
As shown in fig. 3, the recurrent neural network structure based on two layers of LSTM units is used to perform the operations of encoding and decoding the input feature vector, so as to realize the conversion of the natural language text. The first layer of LSTM neurons finish the coding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the input of the second layer of LSTM neurons; when all the feature vectors of the video frames are input into the first layer of LSTM neurons, the second layer of LSTM neurons receive an indicator and begin the decoding task. In the decoding stage, the network has information loss, so the goal of model parameter training and learning is to maximize the log-likelihood function of the whole output statement prediction on the premise of giving hidden layer expression and output prediction at the last moment. For output sentence Y with parameter θ1,y2,…,ym) The model, the parametric optimization objective, can be expressed as:
here, θ is a parameter, Y represents an output prediction statement, h is a hidden layer expression, the objective function is optimized by using a random gradient descent method, and errors of the whole network are cumulatively transferred in a time dimension by a back propagation algorithm.
(4) Carrying out weighted average on output values of two independent models at each moment and obtaining a description sentence corresponding to a video, wherein the specific operation is as follows:
1. carrying out weighted average on output values of the second layer of LSTM neurons at each moment of the two independent models;
2. the probability of occurrence of each word in the vocabulary V is calculated using the softmax function, and is expressed as:
where y denotes the predicted word, ztRepresenting the output value, W, of the recurrent neural network at time tyRepresenting the weight value of the word in the vocabulary.
3. And in the decoding stage at each moment, the word with the maximum probability in the output value of the softmax function is taken, so that the corresponding video description sentence is obtained.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (9)
1. A video description method based on two-way fractal network and LSTM is characterized in that firstly, sampling key frames of a video to be described is carried out, optical flow characteristics between two adjacent frames of an original video are extracted, then the two fractal networks are used for learning respectively and obtaining high-level characteristic expressions of the key frames and the optical flow characteristics, then the high-level characteristic expressions are respectively input into two recurrent neural network models based on an LSTM unit, and finally, weighted average is carried out on output values of the two independent recurrent neural network models at each moment, so that description sentences corresponding to the video are obtained; the method specifically comprises the following steps:
s1, sampling key frames of a video to be described, and extracting optical flow characteristics between two adjacent frames of an original video;
s2, learning and obtaining high-level feature expressions of key frames and optical flow features through two fractal networks respectively; wherein the fractal network is generated by repeated application of a single expansion rule;
s3, respectively inputting the high-level feature vectors obtained in the previous step into two recurrent neural network models based on an LSTM unit;
and S4, carrying out weighted average on the output values of the two independent recurrent neural network models at each moment and obtaining the description sentences corresponding to the video.
2. The method for describing the video based on the two-way fractal network and the LSTM according to claim 1, wherein the step S1 of extracting the optical flow features from the video to be described specifically includes:
s1.1, respectively calculating optical flow characteristic values in the x direction and the y direction of every two adjacent frames of the video, and normalizing to a pixel range of [0,255 ];
and S1.2, calculating the amplitude value of the optical flow, and combining the optical flow characteristic values obtained in the last step to form an optical flow graph.
3. The video description method based on two-way fractal network and LSTM as claimed in claim 1, wherein the specific steps of obtaining the high-level feature expression of the key frame and optical flow features in step S2 are as follows:
s2.1, sequentially inputting the key frames of the video obtained in the step S1 to a fractal network for processing the spatial dimension relation in a time point sequence, and sequentially generating corresponding visual feature vectors through the nonlinear mapping relation of the network;
and S2.2, sequentially inputting the optical flow graph obtained in the step S1 to a second fractal network for processing the time dimension relation in the time point sequence, and sequentially generating corresponding motion characteristic vectors through the nonlinear mapping relation of the network.
4. A video description method based on two-way fractal network and LSTM according to claim 3, characterized in that the repeated application of the single extension rule in steps S2.1 and S2.2 generates an extremely deep network whose structural layout is a truncated fractal; the network comprises interacting sub-paths of different lengths but does not comprise any through connections; meanwhile, in order to realize the capability of extracting the high-performance fixed-depth sub-network, a path abandoning method is adopted to regularize the rule of cooperative adaptation of the sub-paths in the fractal architecture; for fractal networks, the simplicity of training corresponds to the simplicity of design, with a single loss function connected to the last layer sufficient to drive internal behavior to mimic deep supervision; the adopted fractal network is a deep convolutional neural network based on a fractal structure.
5. The method for describing videos based on two-way fractal network and LSTM according to claim 4, wherein the repeated application of the single extension rule in steps S2.1 and S2.2 generates an extremely deep network whose structural layout is a truncated fractal, specifically:
basic situation f1(z) layers comprising a single selected type between input and output; let C denote the truncated fractal fCIndex of (c), fC() defines network architecture, connections, and layer types; wherein the basic case is a network representation comprising a single convolutional layer as in equation (1-1):
f1(z)=conv(z) (1-1)
the following fractal is recursively defined as in equation (1-2):
in the formula (1-2), the first and second,represents compounding ofIndicating a connection operation, C corresponding to the number of columns, or network fCWidth of (·); depth is defined as the longest path from input to outputconvNumber of layers, proportional to 2C-1(ii) a Convolutional networks for classification typically have a distributed arrangement of layers of aggregation; for the same purpose, use is made of fCAs a building unit, stack it with the next collection layer B times, resulting in a total depth B.2C-1(ii) a Connecting operationCombining the two feature blocks into one; a feature block is the result of a conv layer: maintaining an activated tensor for fixed channels in a spatial region; the number of channels corresponds to the number of filters of the preceding conv layer; when the parts are spread, combining the adjacent connections into a single connection layer; the connection layer merges all its input feature blocks into a single output block.
6. The video description method according to claim 4, wherein the regularized adaptation rule of the sub-paths in the fractal architecture by using one of the path discarding methods in steps S2.1 and S2.2 is specifically: because the fractal network comprises an additional large-scale structure, a coarse-grained regularization strategy similar to dropout and drop-connect is used, and paths abandon that the common adaptation of parallel paths is forbidden by randomly discarding operands of a connection layer, so that the mode effectively prevents the network from using one path as an anchor and the other path as a correction to possibly cause over-fitting behavior; two sampling strategies were employed:
for local, the connectivity layer discards each input with a fixed probability, but guarantees that at least one input is retained;
for global, each path is chosen for the entire network, and by restricting this path to being single-column, each column is motivated to be a powerful predictor.
7. The method for describing a video based on a two-way fractal network and an LSTM according to claim 1, wherein the step S3 of inputting the high-level feature vector to two recurrent neural network models based on LSTM units specifically includes: the recurrent neural network based on the LSTM units includes two layers of LSTM units, the first layer and the second layer respectively include 1000 neurons, wherein the forward propagation process of each LSTM neural unit can be expressed as:
it=σ(Wxixt+Whiht-1+bi) (1-3)
ft=σ(Wxfxt+Whfht-1+bf) (1-4)
ot=σ(Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt(1-7)
wherein,is a sigmoid non-linear activation function,is a hyperbolic tangent nonlinear activation function; i.e. it,ft,ot,ctRespectively representing the state quantities corresponding to the input gate, the memory gate, the output gate and the core gate at the time t; for each logic gate, Wxi,Wxf,Wxo,WxgRespectively representing the weight transfer matrices, W, corresponding to the input gate, memory gate, output gate and core gatehi,Whf,Who,WhgRespectively represent an input gate and a memory gateThe output gate and the core gate hide the layer variable h at the time t-1t-1Corresponding weight transfer matrix, bi,bf,bo,bgRespectively representing the corresponding offset vectors of the input gate, the memory gate, the output gate and the core gate.
8. The video description method based on two-way fractal network and LSTM of claim 7, wherein the neural network model structure in step S3 is:
realizing the conversion of natural language texts based on a recurrent neural network of two layers of LSTM units; the first layer of LSTM neurons finish the coding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the input of the second layer of LSTM neurons; when the feature vectors of all video frames are input into the first layer of LSTM neurons, the second layer of LSTM neurons can receive an indicator and start a decoding task; in the decoding stage, the network has information loss, so the goal of model parameter training and learning is to maximize the log-likelihood function of the whole output statement prediction on the premise of giving hidden layer expression and output prediction at the previous moment; for output sentence Y with parameter θ1,y2,…,ym) The model, the parametric optimization objective, can be expressed as:
here, θ is a parameter, Y represents an output prediction statement, h is a hidden layer expression, the objective function is optimized by using a random gradient descent method, and errors of the whole network are cumulatively transferred in a time dimension by a back propagation algorithm.
9. The method for describing the video based on the two-way fractal network and the LSTM according to claim 1, wherein the step S4 is specifically operated to perform weighted average on the output values of the two neural network independent models at each moment and obtain the description sentence corresponding to the video:
s4.1, carrying out weighted average on output values of second-layer LSTM neurons at each moment of the two independent recurrent neural network models;
s4.2, calculating the occurrence probability of each word in the vocabulary V by adopting a softmax function, wherein the probability is expressed as:
where y denotes the predicted word, ztRepresenting the output value, W, of the recurrent neural network at time tyRepresenting the weight value of the word in the vocabulary;
and S4.3, in the decoding stage of each moment, taking the word with the maximum probability in the output value of the softmax function, thereby obtaining the corresponding video description sentence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710111507.8A CN106934352A (en) | 2017-02-28 | 2017-02-28 | A kind of video presentation method based on two-way fractal net work and LSTM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710111507.8A CN106934352A (en) | 2017-02-28 | 2017-02-28 | A kind of video presentation method based on two-way fractal net work and LSTM |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106934352A true CN106934352A (en) | 2017-07-07 |
Family
ID=59424160
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710111507.8A Pending CN106934352A (en) | 2017-02-28 | 2017-02-28 | A kind of video presentation method based on two-way fractal net work and LSTM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106934352A (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107644519A (en) * | 2017-10-09 | 2018-01-30 | 中电科新型智慧城市研究院有限公司 | A kind of intelligent alarm method and system based on video human Activity recognition |
CN107909014A (en) * | 2017-10-31 | 2018-04-13 | 天津大学 | A kind of video understanding method based on deep learning |
CN107909115A (en) * | 2017-12-04 | 2018-04-13 | 上海师范大学 | A kind of image Chinese subtitle generation method |
CN108198202A (en) * | 2018-01-23 | 2018-06-22 | 北京易智能科技有限公司 | A kind of video content detection method based on light stream and neural network |
CN108235116A (en) * | 2017-12-27 | 2018-06-29 | 北京市商汤科技开发有限公司 | Feature propagation method and device, electronic equipment, program and medium |
CN108228915A (en) * | 2018-03-29 | 2018-06-29 | 华南理工大学 | A kind of video retrieval method based on deep learning |
CN108470212A (en) * | 2018-01-31 | 2018-08-31 | 江苏大学 | A kind of efficient LSTM design methods that can utilize incident duration |
CN108536735A (en) * | 2018-03-05 | 2018-09-14 | 中国科学院自动化研究所 | Multi-modal lexical representation method and system based on multichannel self-encoding encoder |
CN109284682A (en) * | 2018-08-21 | 2019-01-29 | 南京邮电大学 | A kind of gesture identification method and system based on STT-LSTM network |
CN109460812A (en) * | 2017-09-06 | 2019-03-12 | 富士通株式会社 | Average information analytical equipment, the optimization device, feature visualization device of neural network |
CN109522451A (en) * | 2018-12-13 | 2019-03-26 | 连尚(新昌)网络科技有限公司 | Repeat video detecting method and device |
CN109753897A (en) * | 2018-12-21 | 2019-05-14 | 西北工业大学 | Based on memory unit reinforcing-time-series dynamics study Activity recognition method |
CN109785336A (en) * | 2018-12-18 | 2019-05-21 | 深圳先进技术研究院 | Image partition method and device based on multipath convolutional neural networks model |
CN110008789A (en) * | 2018-01-05 | 2019-07-12 | 中国移动通信有限公司研究院 | Multiclass object detection and knowledge method for distinguishing, equipment and computer readable storage medium |
CN110019952A (en) * | 2017-09-30 | 2019-07-16 | 华为技术有限公司 | Video presentation method, system and device |
CN110084259A (en) * | 2019-01-10 | 2019-08-02 | 谢飞 | A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature |
CN110197195A (en) * | 2019-04-15 | 2019-09-03 | 深圳大学 | A kind of novel deep layer network system and method towards Activity recognition |
CN110475129A (en) * | 2018-03-05 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Method for processing video frequency, medium and server |
CN110531163A (en) * | 2019-04-18 | 2019-12-03 | 中国人民解放军国防科技大学 | Bus capacitance state monitoring method for suspension chopper of maglev train |
CN111767765A (en) * | 2019-04-01 | 2020-10-13 | Oppo广东移动通信有限公司 | Video processing method and device, storage medium and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279495A (en) * | 2015-10-23 | 2016-01-27 | 天津大学 | Video description method based on deep learning and text summarization |
CN106407649A (en) * | 2016-08-26 | 2017-02-15 | 中国矿业大学(北京) | Onset time automatic picking method of microseismic signal on the basis of time-recursive neural network |
-
2017
- 2017-02-28 CN CN201710111507.8A patent/CN106934352A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279495A (en) * | 2015-10-23 | 2016-01-27 | 天津大学 | Video description method based on deep learning and text summarization |
CN106407649A (en) * | 2016-08-26 | 2017-02-15 | 中国矿业大学(北京) | Onset time automatic picking method of microseismic signal on the basis of time-recursive neural network |
Non-Patent Citations (4)
Title |
---|
GUSTAV LARSSON ET AL.: "FractalNet:Ultra-Deep Neural Networks without Residuals", 《ARXIV:1605.07648V2》 * |
JOE YUE-HEI NG ET AL.: "Beyond Short Snippets:Deep Networks for Videos Classification", 《IEEE》 * |
KAREN SIMONYAN ET AL.: "Two-Stream Convolutional Networks for Action Recognition in Videos", 《ARXIV:1406.2199V2》 * |
SUBHASHINI VENUGOPALAN ET AL.: "Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text", 《ARXIV :1604.01729V 1》 * |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460812A (en) * | 2017-09-06 | 2019-03-12 | 富士通株式会社 | Average information analytical equipment, the optimization device, feature visualization device of neural network |
CN110019952B (en) * | 2017-09-30 | 2023-04-18 | 华为技术有限公司 | Video description method, system and device |
CN110019952A (en) * | 2017-09-30 | 2019-07-16 | 华为技术有限公司 | Video presentation method, system and device |
CN107644519A (en) * | 2017-10-09 | 2018-01-30 | 中电科新型智慧城市研究院有限公司 | A kind of intelligent alarm method and system based on video human Activity recognition |
CN107909014A (en) * | 2017-10-31 | 2018-04-13 | 天津大学 | A kind of video understanding method based on deep learning |
CN107909115A (en) * | 2017-12-04 | 2018-04-13 | 上海师范大学 | A kind of image Chinese subtitle generation method |
CN108235116A (en) * | 2017-12-27 | 2018-06-29 | 北京市商汤科技开发有限公司 | Feature propagation method and device, electronic equipment, program and medium |
CN108235116B (en) * | 2017-12-27 | 2020-06-16 | 北京市商汤科技开发有限公司 | Feature propagation method and apparatus, electronic device, and medium |
CN110008789A (en) * | 2018-01-05 | 2019-07-12 | 中国移动通信有限公司研究院 | Multiclass object detection and knowledge method for distinguishing, equipment and computer readable storage medium |
CN108198202A (en) * | 2018-01-23 | 2018-06-22 | 北京易智能科技有限公司 | A kind of video content detection method based on light stream and neural network |
CN108470212A (en) * | 2018-01-31 | 2018-08-31 | 江苏大学 | A kind of efficient LSTM design methods that can utilize incident duration |
CN108470212B (en) * | 2018-01-31 | 2020-02-21 | 江苏大学 | Efficient LSTM design method capable of utilizing event duration |
CN108536735A (en) * | 2018-03-05 | 2018-09-14 | 中国科学院自动化研究所 | Multi-modal lexical representation method and system based on multichannel self-encoding encoder |
CN108536735B (en) * | 2018-03-05 | 2020-12-15 | 中国科学院自动化研究所 | Multi-mode vocabulary representation method and system based on multi-channel self-encoder |
CN110475129A (en) * | 2018-03-05 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Method for processing video frequency, medium and server |
CN108228915A (en) * | 2018-03-29 | 2018-06-29 | 华南理工大学 | A kind of video retrieval method based on deep learning |
CN109284682A (en) * | 2018-08-21 | 2019-01-29 | 南京邮电大学 | A kind of gesture identification method and system based on STT-LSTM network |
CN109522451B (en) * | 2018-12-13 | 2024-02-27 | 连尚(新昌)网络科技有限公司 | Repeated video detection method and device |
CN109522451A (en) * | 2018-12-13 | 2019-03-26 | 连尚(新昌)网络科技有限公司 | Repeat video detecting method and device |
CN109785336A (en) * | 2018-12-18 | 2019-05-21 | 深圳先进技术研究院 | Image partition method and device based on multipath convolutional neural networks model |
CN109785336B (en) * | 2018-12-18 | 2020-11-27 | 深圳先进技术研究院 | Image segmentation method and device based on multipath convolutional neural network model |
CN109753897A (en) * | 2018-12-21 | 2019-05-14 | 西北工业大学 | Based on memory unit reinforcing-time-series dynamics study Activity recognition method |
CN109753897B (en) * | 2018-12-21 | 2022-05-27 | 西北工业大学 | Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning |
CN110084259B (en) * | 2019-01-10 | 2022-09-20 | 谢飞 | Facial paralysis grading comprehensive evaluation system combining facial texture and optical flow characteristics |
CN110084259A (en) * | 2019-01-10 | 2019-08-02 | 谢飞 | A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature |
CN111767765A (en) * | 2019-04-01 | 2020-10-13 | Oppo广东移动通信有限公司 | Video processing method and device, storage medium and electronic equipment |
CN110197195B (en) * | 2019-04-15 | 2022-12-23 | 深圳大学 | Novel deep network system and method for behavior recognition |
CN110197195A (en) * | 2019-04-15 | 2019-09-03 | 深圳大学 | A kind of novel deep layer network system and method towards Activity recognition |
CN110531163A (en) * | 2019-04-18 | 2019-12-03 | 中国人民解放军国防科技大学 | Bus capacitance state monitoring method for suspension chopper of maglev train |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106934352A (en) | A kind of video presentation method based on two-way fractal net work and LSTM | |
CN111985245B (en) | Relationship extraction method and system based on attention cycle gating graph convolution network | |
CN109271522B (en) | Comment emotion classification method and system based on deep hybrid model transfer learning | |
CN109934261B (en) | Knowledge-driven parameter propagation model and few-sample learning method thereof | |
CN113487088A (en) | Traffic prediction method and device based on dynamic space-time diagram convolution attention model | |
Liu et al. | Time series prediction based on temporal convolutional network | |
CN116415654A (en) | Data processing method and related equipment | |
CN111914085A (en) | Text fine-grained emotion classification method, system, device and storage medium | |
CN113535953B (en) | Meta learning-based few-sample classification method | |
CN112115744B (en) | Point cloud data processing method and device, computer storage medium and electronic equipment | |
CN112529071B (en) | Text classification method, system, computer equipment and storage medium | |
CN109583659A (en) | User's operation behavior prediction method and system based on deep learning | |
CN114925205B (en) | GCN-GRU text classification method based on contrast learning | |
Khoshraftar et al. | Dynamic graph embedding via lstm history tracking | |
Feng et al. | A survey of visual neural networks: current trends, challenges and opportunities | |
Srinivas et al. | A comprehensive survey of techniques, applications, and challenges in deep learning: A revolution in machine learning | |
CN116663523B (en) | Semantic text similarity calculation method for multi-angle enhanced network | |
CN115761654B (en) | Vehicle re-identification method | |
Zhou et al. | What happens next? Combining enhanced multilevel script learning and dual fusion strategies for script event prediction | |
Zhu | A graph neural network-enhanced knowledge graph framework for intelligent analysis of policing cases | |
CN111259673A (en) | Feedback sequence multi-task learning-based law decision prediction method and system | |
CN116050523A (en) | Attention-directed enhanced common sense reasoning framework based on mixed knowledge graph | |
Nagrath et al. | A comprehensive E-commerce customer behavior analysis using convolutional methods | |
Mai et al. | From Efficient Multimodal Models to World Models: A Survey | |
CN112528015B (en) | Method and device for judging rumor in message interactive transmission |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170707 |
|
RJ01 | Rejection of invention patent application after publication |