CN106934352A - A kind of video presentation method based on two-way fractal net work and LSTM - Google Patents

A kind of video presentation method based on two-way fractal net work and LSTM Download PDF

Info

Publication number
CN106934352A
CN106934352A CN201710111507.8A CN201710111507A CN106934352A CN 106934352 A CN106934352 A CN 106934352A CN 201710111507 A CN201710111507 A CN 201710111507A CN 106934352 A CN106934352 A CN 106934352A
Authority
CN
China
Prior art keywords
video
network
fractal
lstm
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710111507.8A
Other languages
Chinese (zh)
Inventor
李楚怡
袁东芝
余卫宇
胡丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201710111507.8A priority Critical patent/CN106934352A/en
Publication of CN106934352A publication Critical patent/CN106934352A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of video presentation method based on two-way fractal net work and LSTM.Methods described carries out the sampling of key frame to video to be described, first, and extract the Optical-flow Feature between adjacent two frame of former video, then learnt respectively by two fractal net works and obtain frame of video and the expression of the high-level characteristic of Optical-flow Feature, two recurrent neural networks models based on LSTM units are separately input to again, finally the output valve at two each moment of independent model is weighted averagely, so as to obtain descriptive statement corresponding with the video.The present invention has been utilized respectively the information of former frame of video and light stream to video to be described, the Optical-flow Feature of addition compensate for the multidate information that sample frame is inevitably lost, it is contemplated that change of the video on Spatial Dimension and time dimension.Furthermore, abstract visual signature is carried out to low-level image feature by novel fractal net work and is expressed, so that people, thing, behavior and spatial relation for being more accurately related in analysis mining video etc. are contacted.

Description

Video description method based on two-way fractal network and LSTM
Technical Field
The invention belongs to the technical field of video description and deep learning, and particularly relates to a video description method based on a two-way fractal network and an LSTM.
Background
With the progress of science and technology and the development of society, various video camera terminals, especially smart phones, are very popular, and the price of hardware storage is increasingly low, so that the multimedia information flow is exponentially increased. In the presence of a large amount of video information streams, how to perform efficient and automatic analysis, recognition and understanding on massive video information with the human intervention minimized, so as to describe semantically, has become a popular topic in the current image processing and computer vision research fields. It may be a simple matter for most people to describe a video in language after watching a brief video. However, it is a challenging task for a machine to generate a natural language description by extracting pixel information of each frame of image in a video, analyzing and processing the pixel information.
The machine can efficiently and automatically describe the video, and has wide application prospects in the field of computer vision such as video retrieval, man-machine interaction, traffic security and the like, so that the research of people on semantic description of the video is further promoted.
Disclosure of Invention
The invention mainly aims to overcome the defects of the prior art and provide a video description method based on a two-way fractal network and an LSTM.
In order to achieve the purpose, the invention adopts the following technical scheme:
a video description method based on two-way fractal network and LSTM is characterized in that firstly, sampling key frames of a video to be described is carried out, optical flow characteristics between two adjacent frames of an original video are extracted, then, high-level characteristic expressions of the key frames and the optical flow characteristics are obtained through learning of the two fractal networks respectively, then, the high-level characteristic expressions are input into two recurrent neural network models based on an LSTM unit respectively, and finally, weighted average is carried out on output values of the two independent recurrent neural network models at each moment, so that description sentences corresponding to the video are obtained. The method specifically comprises the following steps:
s1, sampling key frames of a video to be described, and extracting optical flow characteristics between two adjacent frames of an original video;
s2, learning and obtaining high-level feature expressions of video frames and optical flow features through two fractal networks respectively;
s3, respectively inputting the high-level feature vectors obtained in the previous step into two recurrent neural networks based on an LSTM unit;
and S4, carrying out weighted average on the output values of the two independent models at each moment and obtaining the description sentences corresponding to the video.
Preferably, in step S1, the extracting optical flow features of the video to be described specifically includes:
s1.1, respectively calculating optical flow characteristic values in the x direction and the y direction of every two adjacent frames of the video, and normalizing to a pixel range of [0,255 ];
and S1.2, calculating the amplitude value of the optical flow, and combining the optical flow characteristic values obtained in the last step to form an optical flow graph.
Preferably, the specific steps of obtaining the high-level feature expression of the key frame and the optical flow feature in step S2 are as follows:
s2.1, sequentially inputting the key frames of the video obtained in the step S1 to a fractal network for processing the spatial dimension relation in a time point sequence, and sequentially generating corresponding visual feature vectors through the nonlinear mapping relation of the network;
and S2.2, sequentially inputting the optical flow graph obtained in the step S1 to a second fractal network for processing the time dimension relation in the time point sequence, and sequentially generating corresponding motion characteristic vectors through the nonlinear mapping relation of the network.
Preferably, for repeated application in steps S2.1 and S2.2 by a single extension rule, an extremely deep network is generated whose structural layout is a truncated fractal; the network comprises interacting sub-paths of different lengths but does not comprise any through connections; meanwhile, in order to realize the capability of extracting the high-performance fixed-depth sub-network, a path abandoning method is adopted to regularize the rule of cooperative adaptation of the sub-paths in the fractal architecture; for fractal networks, the simplicity of training corresponds to the simplicity of design, with a single loss function connected to the last layer sufficient to drive internal behavior to mimic deep supervision; the adopted fractal network is a deep convolutional neural network based on a fractal structure.
Preferably, the repeated application of the single extension rule in steps S2.1 and S2.2 generates an extremely deep network whose structural layout is a truncated fractal specifically:
basic situation f1(z) layers comprising a single selected type between input and output; let C denote a cutoff scoreShape fCIndex of (c), fC(. cndot.) defines the network architecture, connections, and layer types. Wherein the basic case is a network representation comprising a single convolutional layer as in equation (1-1):
f1(z)=conv(z) (1-1)
the following fractal is recursively defined as in equation (1-2):
in the formula (1-2), the first and second,represents compounding ofIndicating a connection operation, C corresponding to the number of columns, or network fCWidth of (·); depth is defined as the number of conv layers on the longest path from input to output, proportional to 2C-1(ii) a Convolutional networks for classification typically have a distributed arrangement of layers of aggregation; for the same purpose, use is made of fCAs a building unit, stack it with the next collection layer B times, resulting in a total depth B.2C-1(ii) a Connecting operationCombining the two feature blocks into one; a feature block is the result of a conv layer: maintaining an activated tensor for fixed channels in a spatial region; the number of channels corresponds to the number of filters of the preceding conv layer; when the parts are spread, combining the adjacent connections into a single connection layer; the connection layer merges all its input feature blocks into a single output block.
Preferably, the rule for regularizing the cooperative adaptation of the sub-paths in the fractal architecture by using the path discarding method in steps S2.1 and S2.2 specifically includes: because the fractal network comprises an additional large-scale structure, a coarse-grained regularization strategy similar to dropout and drop-connect is used, and paths abandon that the common adaptation of parallel paths is forbidden by randomly discarding operands of a connection layer, so that the mode effectively prevents the network from using one path as an anchor and the other path as a correction to possibly cause over-fitting behavior; two sampling strategies were employed:
for local, the connectivity layer discards each input with a fixed probability, but guarantees that at least one input is retained;
for global, each path is chosen for the entire network, and by restricting this path to being single-column, each column is motivated to be a powerful predictor.
Preferably, the step S3 of inputting the high-level feature vector into two LSTM unit-based recurrent neural network models is specifically:
the recurrent neural network based on the LSTM units includes two layers of LSTM units, the first layer and the second layer respectively include 1000 neurons, wherein the forward propagation process of each LSTM neural unit can be expressed as:
it=σ(Wxixt+Whiht-1+bi) (1-3)
ft=σ(Wxfxt+Whfht-1+bf) (1-4)
ot=σ(Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt(1-7)
wherein, σ (x)t)=(1+e-xt)-1Is a sigmoid non-linear activation function,is a hyperbolic tangent nonlinear activation function; i.e. it,ft,ot,ctRespectively representing the state quantities corresponding to the input gate, the memory gate, the output gate and the core gate at the time t; for each logic gate, Wxi,Wxf,Wxo,WxgRespectively representing the weight transfer matrices, W, corresponding to the input gate, memory gate, output gate and core gatehi,Whf,Who,WhgRespectively representing hidden layer variable h of input gate, memory gate, output gate and core gate at time t-1t-1Corresponding weight transfer matrix, bi,bf,bo,bgRespectively representing the corresponding offset vectors of the input gate, the memory gate, the output gate and the core gate.
Preferably, the neural network model structure in step S3 is:
based on the recurrent neural network structure diagram of the two layers of LSTM units, the recurrent neural network of the LSTM unit stacked by two layers is utilized to carry out the operation of coding and decoding the input feature vector, thereby realizing the conversion of the natural language text; the first layer of LSTM neurons finish the coding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the input of the second layer of LSTM neurons; when the feature vectors of all video frames are input into the first layer of LSTM neurons, the second layer of LSTM neurons can receive an indicator and start a decoding task; in the decoding stage, the network has information loss, so the goal of model parameter training and learning is to maximize the log-likelihood function of the whole output statement prediction on the premise of giving hidden layer expression and output prediction at the previous moment; for output sentence Y with parameter θ1,y2,…,ym) The model, the parametric optimization objective, can be expressed as:
here, θ is a parameter, Y represents an output prediction statement, h is a hidden layer expression, the objective function is optimized by using a random gradient descent method, and errors of the whole network are cumulatively transferred in a time dimension by a back propagation algorithm.
Preferably, in step S4, the specific operation of weighted averaging the output values of the two neural network independent models at each time point and obtaining the description sentence corresponding to the video is:
s4.1, carrying out weighted average on output values of second-layer LSTM neurons at each moment of the two independent recurrent neural network models;
s4.2, calculating the occurrence probability of each word in the vocabulary V by adopting a softmax function, wherein the probability is expressed as:
where y denotes the predicted word, ztRepresenting the output value, W, of the recurrent neural network at time tyRepresenting the weight value of the word in the vocabulary.
And S4.3, in the decoding stage of each moment, taking the word with the maximum probability in the output value of the softmax function, thereby obtaining the corresponding video description sentence.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the optical flow characteristics added by the invention compensate the dynamic information which is inevitably lost by the sampling frames, and the change of the video in the space dimension and the time dimension is considered.
2. The video description method based on the two-way fractal network and the LSTM can automatically generate a descriptive language about video contents from end to end by processing any input video, and can be applied to the application fields of video retrieval, video monitoring, human-computer interaction and the like.
3. The invention carries out abstract visual feature expression on the bottom layer features through a novel fractal network, thereby more accurately analyzing and mining the relation of people, objects, behaviors, spatial position relation and the like in the video.
Drawings
Fig. 1 is a flow framework diagram of a two-way fractal network and LSTM based video description method provided by the present invention;
FIG. 2 is a schematic diagram of a fractal subnetwork used in an embodiment of the present invention;
FIG. 3 is a schematic diagram of an LSTM unit based recurrent neural network employed by an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
The method comprises the steps of sampling key frames of a video to be described, extracting optical flow characteristics between two adjacent frames of an original video, learning and obtaining high-level characteristic expressions of the key frames and the optical flow characteristics through two fractal networks, inputting the high-level characteristic expressions into two recurrent neural network models based on an LSTM unit, and performing weighted average on output values of the two independent recurrent neural network models at each moment to obtain a description sentence corresponding to the video.
FIG. 1 is an overall flow chart of the present invention, comprising the steps of:
(1) sampling key frames of a video to be described, and extracting optical flow characteristics between two adjacent frames of an original video; wherein, the specific operation of extracting the optical flow characteristic of the video to be described is as follows:
1. respectively calculating the light flow values in the x direction and the y direction of every two adjacent frames of the video, and normalizing to the pixel range of [0,255 ];
2. and calculating the magnitude value of the optical flow, and combining the optical flow characteristic values obtained in the last step to form an optical flow graph.
(2) And respectively learning and obtaining high-level feature expressions of video frames and optical flow features through two fractal networks. Sequentially inputting the sampling frames of the video obtained in the first step to a fractal network for processing the spatial dimension relation in the order of time points, and sequentially generating corresponding visual feature vectors through the nonlinear mapping relation of the network; and sequentially inputting the obtained light flow diagrams to a second fractal network for processing the time dimension relation in the time point sequence, and sequentially generating corresponding motion characteristic vectors through the nonlinear mapping relation of the network.
The fractal network is mainly characterized in that a self-similarity-based design strategy is introduced on a macroscopic framework of a neural network, an extremely deep network is generated through repeated application of a single expansion rule, and the structural layout of the fractal network is a truncated fractal. The network comprises interacting sub-paths of different lengths but does not comprise any through connections. Meanwhile, in order to realize the capability of extracting high-performance fixed-depth sub-networks, a path rejection method is adopted to regularize the cooperative adaptation of sub-paths in the fractal architecture. For fractal networks, the simplicity of training corresponds to the simplicity of the design, with a single loss function connected to the last layer sufficient to drive internal behavior to mimic deep supervision. The fractal network adopted in the invention is a deep convolutional neural network based on a fractal structure.
As shown in FIG. 2, a schematic diagram of a fractal structure, a basic scenario f1(z) layers comprising a single selected type between input and output; let C denote the truncated fractal fCIndex of (c), fC(. cndot.) defines the network architecture, connections, and layer types. It is composed ofThe basic case is a network representation containing a single convolutional layer as in equation (1-1):
f1(z)=conv(z) (1-1)
the following fractal structure is then defined by recursion as in equation (1-2):
in the formula (1-2), the first and second,represents compounding ofIndicating a connection operation, C corresponding to the number of columns, or network fCWidth of (·); depth is defined as the number of conv layers on the longest path from input to output, proportional to 2C-1(ii) a Convolutional networks for classification typically have a distributed arrangement of layers of aggregation; for the same purpose, use is made of fCAs a building unit, stack it with the next collection layer B times, resulting in a total depth B.2C-1(ii) a Connecting operationCombining two feature blocks into one, wherein one feature block is the result of one convolutional layer: the activated tensor is maintained for fixed channels in a spatial region. The number of channels corresponds to the number of preceding convolutional layer filters. When the portions are extended, adjacent connections merge into a single connection layer. As shown on the right side of fig. 2, this connectivity layer spans multiple columns, merging all of its input feature blocks into a single output block.
Since fractal networks contain additional large scale structures, it is proposed to use a coarse-grained regularization strategy like dropout and drop-connect. Path dropping inhibits the common adaptation of parallel paths by randomly dropping operands at the link layer, which effectively prevents the network from using one path as an anchor and another path as a correction that may cause overfitting behavior. Here, two sampling strategies are mainly used:
for local, the connectivity layer discards each input with a fixed probability, but guarantees that at least one input is retained;
for global, each path is chosen for the entire network, and by restricting this path to being single-column, each column is motivated to be a powerful predictor.
(3) And respectively inputting the high-level feature vectors obtained in the last step into two recurrent neural networks based on the LSTM unit. The recurrent neural network based on the LSTM units includes two layers of LSTM units, the first layer and the second layer respectively include 1000 neurons, wherein the forward propagation process of each LSTM neural unit can be expressed as:
it=σ(Wxixt+Whiht-1+bi) (1-3)
ft=σ(Wxfxt+Whfht-1+bf) (1-4)
ot=σ(Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt(1-7)
wherein, σ (x)t)=(1+e-xt)-1Is a sigmoid non-linear activation function,is a hyperbolic tangent nonlinear activation function; i.e. it,ft,ot,ctRespectively representing the state quantities corresponding to the input gate, the memory gate, the output gate and the core gate at the time t; for each logic gate, Wxi,Wxf,Wxo,WxgRespectively representing the weight transfer matrices, W, corresponding to the input gate, memory gate, output gate and core gatehi,Whf,Who,WhgRespectively representing hidden layer variable h of input gate, memory gate, output gate and core gate at time t-1t-1Corresponding weight transfer matrix, bi,bf,bo,bgRespectively representing the corresponding offset vectors of the input gate, the memory gate, the output gate and the core gate.
As shown in fig. 3, the recurrent neural network structure based on two layers of LSTM units is used to perform the operations of encoding and decoding the input feature vector, so as to realize the conversion of the natural language text. The first layer of LSTM neurons finish the coding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the input of the second layer of LSTM neurons; when all the feature vectors of the video frames are input into the first layer of LSTM neurons, the second layer of LSTM neurons receive an indicator and begin the decoding task. In the decoding stage, the network has information loss, so the goal of model parameter training and learning is to maximize the log-likelihood function of the whole output statement prediction on the premise of giving hidden layer expression and output prediction at the last moment. For output sentence Y with parameter θ1,y2,…,ym) The model, the parametric optimization objective, can be expressed as:
here, θ is a parameter, Y represents an output prediction statement, h is a hidden layer expression, the objective function is optimized by using a random gradient descent method, and errors of the whole network are cumulatively transferred in a time dimension by a back propagation algorithm.
(4) Carrying out weighted average on output values of two independent models at each moment and obtaining a description sentence corresponding to a video, wherein the specific operation is as follows:
1. carrying out weighted average on output values of the second layer of LSTM neurons at each moment of the two independent models;
2. the probability of occurrence of each word in the vocabulary V is calculated using the softmax function, and is expressed as:
where y denotes the predicted word, ztRepresenting the output value, W, of the recurrent neural network at time tyRepresenting the weight value of the word in the vocabulary.
3. And in the decoding stage at each moment, the word with the maximum probability in the output value of the softmax function is taken, so that the corresponding video description sentence is obtained.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (9)

1. A video description method based on two-way fractal network and LSTM is characterized in that firstly, sampling key frames of a video to be described is carried out, optical flow characteristics between two adjacent frames of an original video are extracted, then the two fractal networks are used for learning respectively and obtaining high-level characteristic expressions of the key frames and the optical flow characteristics, then the high-level characteristic expressions are respectively input into two recurrent neural network models based on an LSTM unit, and finally, weighted average is carried out on output values of the two independent recurrent neural network models at each moment, so that description sentences corresponding to the video are obtained; the method specifically comprises the following steps:
s1, sampling key frames of a video to be described, and extracting optical flow characteristics between two adjacent frames of an original video;
s2, learning and obtaining high-level feature expressions of key frames and optical flow features through two fractal networks respectively; wherein the fractal network is generated by repeated application of a single expansion rule;
s3, respectively inputting the high-level feature vectors obtained in the previous step into two recurrent neural network models based on an LSTM unit;
and S4, carrying out weighted average on the output values of the two independent recurrent neural network models at each moment and obtaining the description sentences corresponding to the video.
2. The method for describing the video based on the two-way fractal network and the LSTM according to claim 1, wherein the step S1 of extracting the optical flow features from the video to be described specifically includes:
s1.1, respectively calculating optical flow characteristic values in the x direction and the y direction of every two adjacent frames of the video, and normalizing to a pixel range of [0,255 ];
and S1.2, calculating the amplitude value of the optical flow, and combining the optical flow characteristic values obtained in the last step to form an optical flow graph.
3. The video description method based on two-way fractal network and LSTM as claimed in claim 1, wherein the specific steps of obtaining the high-level feature expression of the key frame and optical flow features in step S2 are as follows:
s2.1, sequentially inputting the key frames of the video obtained in the step S1 to a fractal network for processing the spatial dimension relation in a time point sequence, and sequentially generating corresponding visual feature vectors through the nonlinear mapping relation of the network;
and S2.2, sequentially inputting the optical flow graph obtained in the step S1 to a second fractal network for processing the time dimension relation in the time point sequence, and sequentially generating corresponding motion characteristic vectors through the nonlinear mapping relation of the network.
4. A video description method based on two-way fractal network and LSTM according to claim 3, characterized in that the repeated application of the single extension rule in steps S2.1 and S2.2 generates an extremely deep network whose structural layout is a truncated fractal; the network comprises interacting sub-paths of different lengths but does not comprise any through connections; meanwhile, in order to realize the capability of extracting the high-performance fixed-depth sub-network, a path abandoning method is adopted to regularize the rule of cooperative adaptation of the sub-paths in the fractal architecture; for fractal networks, the simplicity of training corresponds to the simplicity of design, with a single loss function connected to the last layer sufficient to drive internal behavior to mimic deep supervision; the adopted fractal network is a deep convolutional neural network based on a fractal structure.
5. The method for describing videos based on two-way fractal network and LSTM according to claim 4, wherein the repeated application of the single extension rule in steps S2.1 and S2.2 generates an extremely deep network whose structural layout is a truncated fractal, specifically:
basic situation f1(z) layers comprising a single selected type between input and output; let C denote the truncated fractal fCIndex of (c), fC() defines network architecture, connections, and layer types; wherein the basic case is a network representation comprising a single convolutional layer as in equation (1-1):
f1(z)=conv(z) (1-1)
the following fractal is recursively defined as in equation (1-2):
in the formula (1-2), the first and second,represents compounding ofIndicating a connection operation, C corresponding to the number of columns, or network fCWidth of (·); depth is defined as the longest path from input to outputconvNumber of layers, proportional to 2C-1(ii) a Convolutional networks for classification typically have a distributed arrangement of layers of aggregation; for the same purpose, use is made of fCAs a building unit, stack it with the next collection layer B times, resulting in a total depth B.2C-1(ii) a Connecting operationCombining the two feature blocks into one; a feature block is the result of a conv layer: maintaining an activated tensor for fixed channels in a spatial region; the number of channels corresponds to the number of filters of the preceding conv layer; when the parts are spread, combining the adjacent connections into a single connection layer; the connection layer merges all its input feature blocks into a single output block.
6. The video description method according to claim 4, wherein the regularized adaptation rule of the sub-paths in the fractal architecture by using one of the path discarding methods in steps S2.1 and S2.2 is specifically: because the fractal network comprises an additional large-scale structure, a coarse-grained regularization strategy similar to dropout and drop-connect is used, and paths abandon that the common adaptation of parallel paths is forbidden by randomly discarding operands of a connection layer, so that the mode effectively prevents the network from using one path as an anchor and the other path as a correction to possibly cause over-fitting behavior; two sampling strategies were employed:
for local, the connectivity layer discards each input with a fixed probability, but guarantees that at least one input is retained;
for global, each path is chosen for the entire network, and by restricting this path to being single-column, each column is motivated to be a powerful predictor.
7. The method for describing a video based on a two-way fractal network and an LSTM according to claim 1, wherein the step S3 of inputting the high-level feature vector to two recurrent neural network models based on LSTM units specifically includes: the recurrent neural network based on the LSTM units includes two layers of LSTM units, the first layer and the second layer respectively include 1000 neurons, wherein the forward propagation process of each LSTM neural unit can be expressed as:
it=σ(Wxixt+Whiht-1+bi) (1-3)
ft=σ(Wxfxt+Whfht-1+bf) (1-4)
ot=σ(Wxoxt+Whoht-1+bo) (1-5)
ct=ft*ct-1+it*gt(1-7)
wherein,is a sigmoid non-linear activation function,is a hyperbolic tangent nonlinear activation function; i.e. it,ft,ot,ctRespectively representing the state quantities corresponding to the input gate, the memory gate, the output gate and the core gate at the time t; for each logic gate, Wxi,Wxf,Wxo,WxgRespectively representing the weight transfer matrices, W, corresponding to the input gate, memory gate, output gate and core gatehi,Whf,Who,WhgRespectively represent an input gate and a memory gateThe output gate and the core gate hide the layer variable h at the time t-1t-1Corresponding weight transfer matrix, bi,bf,bo,bgRespectively representing the corresponding offset vectors of the input gate, the memory gate, the output gate and the core gate.
8. The video description method based on two-way fractal network and LSTM of claim 7, wherein the neural network model structure in step S3 is:
realizing the conversion of natural language texts based on a recurrent neural network of two layers of LSTM units; the first layer of LSTM neurons finish the coding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the input of the second layer of LSTM neurons; when the feature vectors of all video frames are input into the first layer of LSTM neurons, the second layer of LSTM neurons can receive an indicator and start a decoding task; in the decoding stage, the network has information loss, so the goal of model parameter training and learning is to maximize the log-likelihood function of the whole output statement prediction on the premise of giving hidden layer expression and output prediction at the previous moment; for output sentence Y with parameter θ1,y2,…,ym) The model, the parametric optimization objective, can be expressed as:
θ * = arg max Σ ( h , Y ) log p ( Y | h ; θ ) - - - ( 1 - 9 )
here, θ is a parameter, Y represents an output prediction statement, h is a hidden layer expression, the objective function is optimized by using a random gradient descent method, and errors of the whole network are cumulatively transferred in a time dimension by a back propagation algorithm.
9. The method for describing the video based on the two-way fractal network and the LSTM according to claim 1, wherein the step S4 is specifically operated to perform weighted average on the output values of the two neural network independent models at each moment and obtain the description sentence corresponding to the video:
s4.1, carrying out weighted average on output values of second-layer LSTM neurons at each moment of the two independent recurrent neural network models;
s4.2, calculating the occurrence probability of each word in the vocabulary V by adopting a softmax function, wherein the probability is expressed as:
P ( y | z t ) = exp ( W y z t ) Σ y ′ ∈ V exp ( W y ′ z t ) - - - ( 1 - 10 )
where y denotes the predicted word, ztRepresenting the output value, W, of the recurrent neural network at time tyRepresenting the weight value of the word in the vocabulary;
and S4.3, in the decoding stage of each moment, taking the word with the maximum probability in the output value of the softmax function, thereby obtaining the corresponding video description sentence.
CN201710111507.8A 2017-02-28 2017-02-28 A kind of video presentation method based on two-way fractal net work and LSTM Pending CN106934352A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710111507.8A CN106934352A (en) 2017-02-28 2017-02-28 A kind of video presentation method based on two-way fractal net work and LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710111507.8A CN106934352A (en) 2017-02-28 2017-02-28 A kind of video presentation method based on two-way fractal net work and LSTM

Publications (1)

Publication Number Publication Date
CN106934352A true CN106934352A (en) 2017-07-07

Family

ID=59424160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710111507.8A Pending CN106934352A (en) 2017-02-28 2017-02-28 A kind of video presentation method based on two-way fractal net work and LSTM

Country Status (1)

Country Link
CN (1) CN106934352A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107644519A (en) * 2017-10-09 2018-01-30 中电科新型智慧城市研究院有限公司 A kind of intelligent alarm method and system based on video human Activity recognition
CN107909014A (en) * 2017-10-31 2018-04-13 天津大学 A kind of video understanding method based on deep learning
CN107909115A (en) * 2017-12-04 2018-04-13 上海师范大学 A kind of image Chinese subtitle generation method
CN108198202A (en) * 2018-01-23 2018-06-22 北京易智能科技有限公司 A kind of video content detection method based on light stream and neural network
CN108235116A (en) * 2017-12-27 2018-06-29 北京市商汤科技开发有限公司 Feature propagation method and device, electronic equipment, program and medium
CN108228915A (en) * 2018-03-29 2018-06-29 华南理工大学 A kind of video retrieval method based on deep learning
CN108470212A (en) * 2018-01-31 2018-08-31 江苏大学 A kind of efficient LSTM design methods that can utilize incident duration
CN108536735A (en) * 2018-03-05 2018-09-14 中国科学院自动化研究所 Multi-modal lexical representation method and system based on multichannel self-encoding encoder
CN109284682A (en) * 2018-08-21 2019-01-29 南京邮电大学 A kind of gesture identification method and system based on STT-LSTM network
CN109460812A (en) * 2017-09-06 2019-03-12 富士通株式会社 Average information analytical equipment, the optimization device, feature visualization device of neural network
CN109522451A (en) * 2018-12-13 2019-03-26 连尚(新昌)网络科技有限公司 Repeat video detecting method and device
CN109753897A (en) * 2018-12-21 2019-05-14 西北工业大学 Based on memory unit reinforcing-time-series dynamics study Activity recognition method
CN109785336A (en) * 2018-12-18 2019-05-21 深圳先进技术研究院 Image partition method and device based on multipath convolutional neural networks model
CN110008789A (en) * 2018-01-05 2019-07-12 中国移动通信有限公司研究院 Multiclass object detection and knowledge method for distinguishing, equipment and computer readable storage medium
CN110019952A (en) * 2017-09-30 2019-07-16 华为技术有限公司 Video presentation method, system and device
CN110084259A (en) * 2019-01-10 2019-08-02 谢飞 A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature
CN110197195A (en) * 2019-04-15 2019-09-03 深圳大学 A kind of novel deep layer network system and method towards Activity recognition
CN110475129A (en) * 2018-03-05 2019-11-19 腾讯科技(深圳)有限公司 Method for processing video frequency, medium and server
CN110531163A (en) * 2019-04-18 2019-12-03 中国人民解放军国防科技大学 Bus capacitance state monitoring method for suspension chopper of maglev train
CN111767765A (en) * 2019-04-01 2020-10-13 Oppo广东移动通信有限公司 Video processing method and device, storage medium and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN106407649A (en) * 2016-08-26 2017-02-15 中国矿业大学(北京) Onset time automatic picking method of microseismic signal on the basis of time-recursive neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279495A (en) * 2015-10-23 2016-01-27 天津大学 Video description method based on deep learning and text summarization
CN106407649A (en) * 2016-08-26 2017-02-15 中国矿业大学(北京) Onset time automatic picking method of microseismic signal on the basis of time-recursive neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
GUSTAV LARSSON ET AL.: "FractalNet:Ultra-Deep Neural Networks without Residuals", 《ARXIV:1605.07648V2》 *
JOE YUE-HEI NG ET AL.: "Beyond Short Snippets:Deep Networks for Videos Classification", 《IEEE》 *
KAREN SIMONYAN ET AL.: "Two-Stream Convolutional Networks for Action Recognition in Videos", 《ARXIV:1406.2199V2》 *
SUBHASHINI VENUGOPALAN ET AL.: "Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text", 《ARXIV :1604.01729V 1》 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460812A (en) * 2017-09-06 2019-03-12 富士通株式会社 Average information analytical equipment, the optimization device, feature visualization device of neural network
CN110019952B (en) * 2017-09-30 2023-04-18 华为技术有限公司 Video description method, system and device
CN110019952A (en) * 2017-09-30 2019-07-16 华为技术有限公司 Video presentation method, system and device
CN107644519A (en) * 2017-10-09 2018-01-30 中电科新型智慧城市研究院有限公司 A kind of intelligent alarm method and system based on video human Activity recognition
CN107909014A (en) * 2017-10-31 2018-04-13 天津大学 A kind of video understanding method based on deep learning
CN107909115A (en) * 2017-12-04 2018-04-13 上海师范大学 A kind of image Chinese subtitle generation method
CN108235116A (en) * 2017-12-27 2018-06-29 北京市商汤科技开发有限公司 Feature propagation method and device, electronic equipment, program and medium
CN108235116B (en) * 2017-12-27 2020-06-16 北京市商汤科技开发有限公司 Feature propagation method and apparatus, electronic device, and medium
CN110008789A (en) * 2018-01-05 2019-07-12 中国移动通信有限公司研究院 Multiclass object detection and knowledge method for distinguishing, equipment and computer readable storage medium
CN108198202A (en) * 2018-01-23 2018-06-22 北京易智能科技有限公司 A kind of video content detection method based on light stream and neural network
CN108470212A (en) * 2018-01-31 2018-08-31 江苏大学 A kind of efficient LSTM design methods that can utilize incident duration
CN108470212B (en) * 2018-01-31 2020-02-21 江苏大学 Efficient LSTM design method capable of utilizing event duration
CN108536735A (en) * 2018-03-05 2018-09-14 中国科学院自动化研究所 Multi-modal lexical representation method and system based on multichannel self-encoding encoder
CN108536735B (en) * 2018-03-05 2020-12-15 中国科学院自动化研究所 Multi-mode vocabulary representation method and system based on multi-channel self-encoder
CN110475129A (en) * 2018-03-05 2019-11-19 腾讯科技(深圳)有限公司 Method for processing video frequency, medium and server
CN108228915A (en) * 2018-03-29 2018-06-29 华南理工大学 A kind of video retrieval method based on deep learning
CN109284682A (en) * 2018-08-21 2019-01-29 南京邮电大学 A kind of gesture identification method and system based on STT-LSTM network
CN109522451B (en) * 2018-12-13 2024-02-27 连尚(新昌)网络科技有限公司 Repeated video detection method and device
CN109522451A (en) * 2018-12-13 2019-03-26 连尚(新昌)网络科技有限公司 Repeat video detecting method and device
CN109785336A (en) * 2018-12-18 2019-05-21 深圳先进技术研究院 Image partition method and device based on multipath convolutional neural networks model
CN109785336B (en) * 2018-12-18 2020-11-27 深圳先进技术研究院 Image segmentation method and device based on multipath convolutional neural network model
CN109753897A (en) * 2018-12-21 2019-05-14 西北工业大学 Based on memory unit reinforcing-time-series dynamics study Activity recognition method
CN109753897B (en) * 2018-12-21 2022-05-27 西北工业大学 Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning
CN110084259B (en) * 2019-01-10 2022-09-20 谢飞 Facial paralysis grading comprehensive evaluation system combining facial texture and optical flow characteristics
CN110084259A (en) * 2019-01-10 2019-08-02 谢飞 A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature
CN111767765A (en) * 2019-04-01 2020-10-13 Oppo广东移动通信有限公司 Video processing method and device, storage medium and electronic equipment
CN110197195B (en) * 2019-04-15 2022-12-23 深圳大学 Novel deep network system and method for behavior recognition
CN110197195A (en) * 2019-04-15 2019-09-03 深圳大学 A kind of novel deep layer network system and method towards Activity recognition
CN110531163A (en) * 2019-04-18 2019-12-03 中国人民解放军国防科技大学 Bus capacitance state monitoring method for suspension chopper of maglev train

Similar Documents

Publication Publication Date Title
CN106934352A (en) A kind of video presentation method based on two-way fractal net work and LSTM
CN111985245B (en) Relationship extraction method and system based on attention cycle gating graph convolution network
CN109271522B (en) Comment emotion classification method and system based on deep hybrid model transfer learning
CN109934261B (en) Knowledge-driven parameter propagation model and few-sample learning method thereof
CN113487088A (en) Traffic prediction method and device based on dynamic space-time diagram convolution attention model
Liu et al. Time series prediction based on temporal convolutional network
CN116415654A (en) Data processing method and related equipment
CN111914085A (en) Text fine-grained emotion classification method, system, device and storage medium
CN113535953B (en) Meta learning-based few-sample classification method
CN112115744B (en) Point cloud data processing method and device, computer storage medium and electronic equipment
CN112529071B (en) Text classification method, system, computer equipment and storage medium
CN109583659A (en) User's operation behavior prediction method and system based on deep learning
CN114925205B (en) GCN-GRU text classification method based on contrast learning
Khoshraftar et al. Dynamic graph embedding via lstm history tracking
Feng et al. A survey of visual neural networks: current trends, challenges and opportunities
Srinivas et al. A comprehensive survey of techniques, applications, and challenges in deep learning: A revolution in machine learning
CN116663523B (en) Semantic text similarity calculation method for multi-angle enhanced network
CN115761654B (en) Vehicle re-identification method
Zhou et al. What happens next? Combining enhanced multilevel script learning and dual fusion strategies for script event prediction
Zhu A graph neural network-enhanced knowledge graph framework for intelligent analysis of policing cases
CN111259673A (en) Feedback sequence multi-task learning-based law decision prediction method and system
CN116050523A (en) Attention-directed enhanced common sense reasoning framework based on mixed knowledge graph
Nagrath et al. A comprehensive E-commerce customer behavior analysis using convolutional methods
Mai et al. From Efficient Multimodal Models to World Models: A Survey
CN112528015B (en) Method and device for judging rumor in message interactive transmission

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170707

RJ01 Rejection of invention patent application after publication