CN106934352A - A kind of video presentation method based on two-way fractal net work and LSTM - Google Patents
A kind of video presentation method based on two-way fractal net work and LSTM Download PDFInfo
- Publication number
- CN106934352A CN106934352A CN201710111507.8A CN201710111507A CN106934352A CN 106934352 A CN106934352 A CN 106934352A CN 201710111507 A CN201710111507 A CN 201710111507A CN 106934352 A CN106934352 A CN 106934352A
- Authority
- CN
- China
- Prior art keywords
- network
- video
- fractal
- lstm
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000003287 optical effect Effects 0.000 claims abstract description 37
- 238000003062 neural network model Methods 0.000 claims abstract description 16
- 230000014509 gene expression Effects 0.000 claims abstract description 15
- 238000005070 sampling Methods 0.000 claims abstract description 11
- 230000006399 behavior Effects 0.000 claims abstract description 8
- 230000000007 visual effect Effects 0.000 claims abstract description 8
- 239000010410 layer Substances 0.000 claims description 77
- 230000000306 recurrent effect Effects 0.000 claims description 23
- 239000013598 vector Substances 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 21
- 210000002569 neuron Anatomy 0.000 claims description 21
- 238000013528 artificial neural network Methods 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 7
- 238000010586 diagram Methods 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 238000013461 design Methods 0.000 claims description 4
- 239000002131 composite material Substances 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 230000003278 mimic effect Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 239000002356 single layer Substances 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 239000000284 extract Substances 0.000 abstract 1
- 238000001994 activation Methods 0.000 description 5
- 230000003993 interaction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
本发明公开了一种基于双路分形网络和LSTM的视频描述方法。所述方法首先对待描述视频进行关键帧的采样,并提取原视频相邻两帧之间的光流特征,然后通过两个分形网络分别学习并获得视频帧和光流特征的高层特征表达,再分别输入到两个基于LSTM单元的递归神经网络模型,最后将两个独立模型每个时刻的输出值进行加权平均,从而获得与所述视频对应的描述语句。本发明对待描述视频分别利用了原视频帧和光流的信息,添加的光流特征补偿了采样帧不可避免会丢失的动态信息,考虑到了视频在空间维度和时间维度上的变化。再者,通过新颖的分形网络对底层特征进行抽象的视觉特征表达,从而更精确地分析挖掘视频中涉及的人、物、行为以及空间位置关系等联系。
The invention discloses a video description method based on two-way fractal network and LSTM. The method first samples the key frames of the video to be described, and extracts the optical flow features between two adjacent frames of the original video, then learns and obtains the high-level feature expressions of the video frames and optical flow features through two fractal networks, and then separately It is input to two recursive neural network models based on LSTM units, and finally the output values of the two independent models at each moment are weighted and averaged to obtain a description sentence corresponding to the video. The present invention uses the information of the original video frame and optical flow for the video to be described, and the added optical flow feature compensates the dynamic information that will inevitably be lost in the sampling frame, and takes into account the change of the video in the space dimension and time dimension. Furthermore, the abstract visual feature representation of the underlying features is performed through a novel fractal network, so as to more accurately analyze and mine the connections between people, objects, behaviors, and spatial location relationships involved in the video.
Description
技术领域technical field
本发明属于视频描述、深度学习技术领域,具体涉及一种基于双路分形网络和LSTM的视频描述方法。The invention belongs to the technical field of video description and deep learning, and in particular relates to a video description method based on a two-way fractal network and LSTM.
背景技术Background technique
随着科技的进步和社会的发展,各类视频摄像终端尤其是智能手机已经非常普及,硬件存储的价格也日益低廉,这使得多媒体信息流成指数式增长。在大量的视频信息流面前,如何能够在尽量减少人工干预下对海量视频信息进行高效自动的分析、识别和理解,从而从语义上给予描述,已成为当前图像处理和计算机视觉研究领域的一个热门课题。对于大多数人而言,观看一个简短的视频后用语言对视频做出描述也许是件很简单的事情。但是,对于机器而言,通过提取视频中各帧图像的像素信息,并对之加以分析、处理,从而生成一句自然语言来描述则是一个富有挑战性的任务。With the advancement of technology and the development of society, all kinds of video camera terminals, especially smart phones, have become very popular, and the price of hardware storage has become increasingly low, which makes the flow of multimedia information grow exponentially. In the face of a large number of video information streams, how to efficiently and automatically analyze, recognize and understand massive video information with minimal human intervention, and thus describe it semantically, has become a hot topic in the current image processing and computer vision research fields. topic. For most people, it may be a simple matter to watch a short video and then describe the video in words. However, for a machine, it is a challenging task to generate a natural language description by extracting the pixel information of each frame image in the video, analyzing and processing it.
让机器能够高效自动地对视频做出描述在诸如视频检索、人机交互、交通安防等计算机视觉领域也有着广泛的应用前景,这将进一步促进人们对视频的语义描述的研究。Enabling machines to efficiently and automatically describe videos also has broad application prospects in computer vision fields such as video retrieval, human-computer interaction, and traffic security, which will further promote people's research on semantic descriptions of videos.
发明内容Contents of the invention
本发明的主要目的在于克服现有技术的缺点与不足,提供一种基于双路分形网络和LSTM的视频描述方法。The main purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, and provide a video description method based on two-way fractal network and LSTM.
为了到达上述目的,本发明采用以下技术方案:In order to achieve the above object, the present invention adopts the following technical solutions:
一种基于双路分形网络和LSTM的视频描述方法,其特征在于,首先对待描述视频进行关键帧的采样,并提取原视频相邻两帧之间的光流特征,然后通过两个分形网络分别学习并获得关键帧和光流特征的高层特征表达,再分别输入到两个基于LSTM单元的递归神经网络模型,最后将两个独立递归神经网络模型每个时刻的输出值进行加权平均,从而获得与所述视频对应的描述语句。具体包括如下步骤:A video description method based on two-way fractal network and LSTM, which is characterized in that firstly, the key frame sampling of the video to be described is carried out, and the optical flow characteristics between two adjacent frames of the original video are extracted, and then the two fractal networks are respectively Learn and obtain the high-level feature expression of the key frame and optical flow features, and then input them to two recurrent neural network models based on LSTM units, and finally perform a weighted average of the output values of the two independent recurrent neural network models at each moment, so as to obtain The description sentence corresponding to the video. Specifically include the following steps:
S1、对待描述视频进行关键帧的采样,并提取原视频相邻两帧之间的光流特征;S1. Sampling the key frames of the video to be described, and extracting the optical flow features between two adjacent frames of the original video;
S2、通过两个分形网络分别学习并获得视频帧和光流特征的高层特征表达;S2. Learning and obtaining high-level feature expressions of video frames and optical flow features through two fractal networks;
S3、分别将上一步获得的高层特征矢量输入到两个基于LSTM单元的递归神经网络;S3. Input the high-level feature vectors obtained in the previous step to two recurrent neural networks based on LSTM units;
S4、将两个独立模型每个时刻的输出值进行加权平均并获得视频对应的描述语句。S4. Perform a weighted average of the output values of the two independent models at each moment to obtain a description sentence corresponding to the video.
优选的,步骤S1中所述对待描述视频提取光流特征具体为:Preferably, the optical flow feature extraction of the video to be described in step S1 is specifically:
S1.1、分别计算视频每相邻两帧的x方向和y方向上的光流特征值,并归一化到[0,255]的像素范围;S1.1. Calculate the optical flow feature values in the x-direction and y-direction of each adjacent two frames of the video, and normalize to the pixel range of [0,255];
S1.2、计算光流的幅度值,并结合上一步获得的光流特征值组合成一张光流图。S1.2. Calculate the amplitude value of the optical flow, and combine the characteristic values of the optical flow obtained in the previous step to form an optical flow graph.
优选的,步骤S2中获得关键帧和光流特征的高层特征表达的具体步骤为:Preferably, the specific steps for obtaining high-level feature representations of key frames and optical flow features in step S2 are:
S2.1、对步骤S1获得的视频的关键帧以时间点的顺序依次输入到第一个处理空间维度关系的分形网络,通过网络的非线性映射关系依次生成对应的视觉特征矢量;S2.1, the key frames of the video obtained in step S1 are sequentially input to the first fractal network that processes the spatial dimension relationship in the order of time points, and the corresponding visual feature vectors are sequentially generated through the nonlinear mapping relationship of the network;
S2.2、对步骤S1获得的光流图以时间点的顺序依次输入到第二个处理时间维度关系的分形网络,通过网络的非线性映射关系依次生成对应的运动特征矢量。S2.2. The optical flow diagram obtained in step S1 is sequentially input into the second fractal network for processing the time-dimensional relationship in the order of time points, and the corresponding motion feature vectors are sequentially generated through the nonlinear mapping relationship of the network.
优选的,对于步骤S2.1和S2.2中的通过单一扩展规则的重复应用生成了一个极深的网络,其结构布局是一个截断的分形;该网络包含长度不同的相互作用子路径,但不包含任何直通式连接;同时,为了实现提取高性能固定深度子网络的能力,采用了一种路径舍弃的方法正则化分形架构里子路径的协同适应的规则;对于分形网络,训练的简单性与设计的简单性相对应,单个连接到最后一层的损失函数足以驱动内部行为去模仿深度监督;所采用的分形网络是基于分形结构的深度卷积神经网络。Preferably, the repeated application by a single extension rule in steps S2.1 and S2.2 generates an extremely deep network whose structural layout is a truncated fractal; the network contains interacting subpaths of different lengths, but Does not contain any straight-through connections; at the same time, in order to achieve the ability to extract high-performance fixed-depth sub-networks, a method of path discarding is used to regularize the rules of co-adaptation of sub-paths in the fractal architecture; for fractal networks, the simplicity of training is comparable to that of Corresponding to the simplicity of the design, a single loss function connected to the last layer is sufficient to drive the internal behavior to mimic deep supervision; the adopted fractal network is a deep convolutional neural network based on fractal structure.
优选的,步骤S2.1和S2.2中的通过单一扩展规则的重复应用生成了一个极深的网络,其结构布局是一个截断的分形具体为:Preferably, repeated application of a single extension rule in steps S2.1 and S2.2 generates an extremely deep network, and its structural layout is a truncated fractal specifically:
基础情形f1(z)包含输入输出之间单个选定类型的层;令C表示截断分形fC(·)的指标,fC(·)定义了网络架构、连接以及层类型。其中,基础情形是包含单个卷积层的网络表示如公式(1-1):The base case f 1 (z) contains a single layer of selected type between the input and output; let C denote the index of the truncated fractal f C (·), which defines the network architecture, connections, and layer types. Among them, the basic situation is that the network representation containing a single convolutional layer is represented by formula (1-1):
f1(z)=conv(z) (1-1)f 1 (z)=conv(z) (1-1)
递归定义接下来的分形如公式(1-2):Recursively define the next fractal as formula (1-2):
在公式(1-2)中,表示复合,而表示连接操作,C对应于列数,或者说网络fC(·)的宽度;深度定义为从输入到输出的最长路径上的conv层的个数,正比于2C-1;用于分类的卷积网络通常分散布置汇集层;为了达到相同目的,使用fC(·)作为构建单元,将之与接下来的汇集层堆叠B次,得到总深度B·2C-1;连接操作把两个特征块合为一个;一个特征块是一个conv层的结果:在一个空间区域为固定的一些通道维持活化的张量;通道数对应于前面的conv层的过滤器的个数;当分形被扩展,把相邻的连接合并成单个连接层;连接层把所有其输入特征块合并成单个输出块。In formula (1-2), means composite, and Represents the connection operation, C corresponds to the number of columns, or the width of the network f C (·); depth is defined as the number of conv layers on the longest path from input to output, proportional to 2 C-1 ; used for classification The convolutional network of usually dispersely arranges the pooling layer; in order to achieve the same purpose, use f C (·) as the building unit, and stack it with the next pooling layer B times to obtain the total depth B 2 C-1 ; the connection operation Merge two feature blocks into one; a feature block is the result of a conv layer: maintain active tensors for a fixed number of channels in a spatial region; the number of channels corresponds to the number of filters in the previous conv layer; when divided The shape is expanded to combine adjacent connections into a single connection layer; a connection layer combines all its input feature blocks into a single output block.
优选的,步骤S2.1和S2.2中一种路径舍弃的方法正则化分形架构里子路径的协同适应的规则具体为:由于分形网络包含额外的大尺度结构,使用一种类似dropout和drop-connect的粗粒度正则化策略,路径舍弃通过随机丢弃连接层的操作数来禁止平行路径的共同适应,这种方式有效防止了网络使用一个路径作为锚标,另一个路径作为修正而可能引起的过拟合行为;采用两个采样策略:Preferably, a method of path discarding in steps S2.1 and S2.2 regularizes the co-adaptation rules of sub-paths in the fractal architecture as follows: since the fractal network contains additional large-scale structures, use a method similar to dropout and drop- Connect's coarse-grained regularization strategy, path discarding prohibits the co-adaptation of parallel paths by randomly discarding the operands of the connection layer. This method effectively prevents the network from using one path as an anchor and another path as a correction. Fitting behavior; two sampling strategies are used:
对于局部,连接层以固定的概率舍弃每个输入,但保证至少保留一个输入;For local, the connection layer discards each input with a fixed probability, but guarantees to keep at least one input;
对于全局,每条路径是为了整个网络选出的,通过限制这条路径是单列的,以激励每列成为有力的预测器。Globally, each path is selected for the entire network, by restricting this path to a single column, incentivizing each column to be a strong predictor.
优选的,步骤S3中所述将高层特征矢量输入到两个基于LSTM单元的递归神经网络模型具体为:Preferably, the input of high-level feature vectors to two LSTM unit-based recursive neural network models described in step S3 is specifically:
基于LSTM单元的递归神经网络包含两层LSTM单元,第一层和第二层分别包含1000个神经元,其中每个LSTM神经单元的前向传播过程可表示为:The recurrent neural network based on the LSTM unit contains two layers of LSTM units, the first layer and the second layer contain 1000 neurons respectively, and the forward propagation process of each LSTM neuron unit can be expressed as:
it=σ(Wxixt+Whiht-1+bi) (1-3)i t = σ(W xi x t +W hi h t-1 +b i ) (1-3)
ft=σ(Wxfxt+Whfht-1+bf) (1-4)f t = σ(W xf x t +W hf h t-1 +b f ) (1-4)
ot=σ(Wxoxt+Whoht-1+bo) (1-5)o t = σ(W xo x t +W ho h t-1 +b o ) (1-5)
ct=ft*ct-1+it*gt (1-7)c t =f t *c t-1 +i t *g t (1-7)
其中,σ(xt)=(1+e-xt)-1是sigmoid非线性激活函数,是双曲正切非线性激活函数;it,ft,ot,ct分别代表t时刻输入门,记忆门,输出门和核心门对应的状态量;对于每个逻辑门,Wxi,Wxf,Wxo,Wxg分别代表输入门,记忆门,输出门和核心门对应的权重转移矩阵,Whi,Whf,Who,Whg分别代表输入门,记忆门,输出门和核心门在t-1时刻隐藏层变量ht-1对应的权重转移矩阵,bi,bf,bo,bg分别代表输入门,记忆门,输出门和核心门对应的偏置向量。Among them, σ(x t )=(1+e -xt ) -1 is the sigmoid nonlinear activation function, is the hyperbolic tangent nonlinear activation function; it , f t , o t , c t respectively represent the state quantities corresponding to the input gate, memory gate, output gate and core gate at time t ; for each logic gate, W xi , W xf , W xo , W xg represent the weight transfer matrix corresponding to the input gate, memory gate, output gate and core gate respectively, W hi , W hf , Who ho , W hg represent the input gate, memory gate, output gate and core gate respectively The weight transfer matrix corresponding to the hidden layer variable h t -1 at time t-1, b i , b f , b o , b g represent the bias vectors corresponding to the input gate, memory gate, output gate and core gate, respectively.
优选的,步骤S3中神经网络模型结构为:Preferably, the structure of the neural network model in step S3 is:
基于两层LSTM单元的递归神经网络结构图,利用这个两层堆叠的LSTM单元的递归神经网络进行对输入特征矢量的编码和解码的操作,从而实现自然语言文本的转换;其中,第一层LSTM神经元完成对每个时刻的输入视觉特征矢量的编码过程,然后每个时刻输出的隐层表达作为第二层LSTM神经元的输入;当所有视频帧的特征矢量都输入到第一层LSTM神经元后,第二层LSTM神经元就会收到一个指示符,并开始解码的任务;在解码的阶段,网络会有信息的损失,因此模型参数训练和学习的目标是在给定隐层表达和上一时刻的输出预测的前提下,最大化整个输出语句预测的对数似然函数;对于用参数θ和输出语句Y=(y1,y2,…,ym)表示的模型,参数优化目标可表示为:Based on the recurrent neural network structure diagram of the two-layer LSTM unit, the recurrent neural network of the two-layer stacked LSTM unit is used to encode and decode the input feature vector, thereby realizing the conversion of natural language text; among them, the first layer of LSTM The neuron completes the encoding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the input of the second layer of LSTM neurons; when the feature vectors of all video frames are input to the first layer of LSTM neurons After the unit, the LSTM neuron in the second layer will receive an indicator and start the task of decoding; in the decoding stage, the network will lose information, so the goal of model parameter training and learning is to express in a given hidden layer Under the premise of the output prediction at the previous moment, maximize the logarithmic likelihood function of the entire output sentence prediction; for the model represented by the parameter θ and the output sentence Y=(y 1 ,y 2 ,…,y m ), the parameter The optimization objective can be expressed as:
这里,θ为参数,Y代表输出的预测语句,h为隐层表达,使用随机梯度下降法对目标函数进行优化,整个网络的误差通过反向传播算法在时间维度上累积传递。Here, θ is a parameter, Y represents the output prediction sentence, h is the expression of the hidden layer, and the stochastic gradient descent method is used to optimize the objective function, and the error of the entire network is accumulated and transferred in the time dimension through the back propagation algorithm.
优选的,步骤S4将两个神经网络独立模型每个时刻的输出值进行加权平均并获得视频对应的描述语句具体操作为:Preferably, in step S4, the output values of the two independent neural network models at each moment are weighted and averaged to obtain the description sentence corresponding to the video. The specific operation is as follows:
S4.1、将两个独立递归神经网络模型每个时刻的第二层LSTM神经元的输出值进行加权平均;S4.1, weighted average the output values of the second layer LSTM neurons at each moment of the two independent recurrent neural network models;
S4.2、采用softmax函数计算在词汇表V中每个单词的出现概率,表示为:S4.2, using the softmax function to calculate the probability of occurrence of each word in the vocabulary V, expressed as:
其中,y表示预测的单词,zt表示递归神经网络在t时刻的输出值,Wy表示该单词在词汇表中的权重值。Among them, y represents the predicted word, z t represents the output value of the recurrent neural network at time t, and W y represents the weight value of the word in the vocabulary.
S4.3、在每个时刻的解码阶段,取softmax函数输出值中最大概率的单词,从而获得对应的视频描述语句。S4.3. In the decoding stage at each moment, take the word with the highest probability in the output value of the softmax function, so as to obtain the corresponding video description sentence.
本发明与现有技术相比,具有如下优点和有益效果:Compared with the prior art, the present invention has the following advantages and beneficial effects:
1、本发明添加的光流特征补偿了采样帧不可避免会丢失的动态信息,考虑到了视频在空间维度和时间维度上的变化。1. The optical flow feature added by the present invention compensates the dynamic information that will inevitably be lost in the sampling frame, and takes into account the changes in the spatial and temporal dimensions of the video.
2、本发明提供的一种基于双路分形网络和LSTM的视频描述方法通过对任意输入视频进行处理,就可以端到端地自动生成关于视频内容的一句描述性语言,可应用于视频检索、视频监控和人机交互等应用领域中。2. A video description method based on a two-way fractal network and LSTM provided by the present invention can automatically generate a sentence of descriptive language about video content end-to-end by processing any input video, which can be applied to video retrieval, Applications such as video surveillance and human-computer interaction.
3、本发明通过新颖的分形网络对底层特征进行抽象的视觉特征表达,从而更精确地分析挖掘视频中涉及的人、物、行为以及空间位置关系等联系。3. The present invention expresses the abstract visual features of the underlying features through a novel fractal network, thereby more accurately analyzing and mining the connections of people, objects, behaviors, and spatial position relationships involved in the video.
附图说明Description of drawings
图1是本发明提供的基于双路分形网络和LSTM的视频描述方法的流程框架图;Fig. 1 is the flow chart of the video description method based on two-way fractal network and LSTM provided by the present invention;
图2是本发明的实施例所采用分形子网络示意图;Fig. 2 is the schematic diagram of the fractal sub-network adopted by the embodiment of the present invention;
图3是本发明的实施例所采用的基于LSTM单元的递归神经网络的示意图。FIG. 3 is a schematic diagram of a recurrent neural network based on LSTM units used in an embodiment of the present invention.
具体实施方式detailed description
下面结合实施例及附图对本发明作进一步详细的描述,但本发明的实施方式不限于此。The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto.
对待描述视频进行关键帧的采样,并提取原视频相邻两帧之间的光流特征,然后通过两个分形网络分别学习并获得关键帧和光流特征的高层特征表达,再分别输入到两个基于LSTM单元的递归神经网络模型,最后将两个独立递归神经网络模型每个时刻的输出值进行加权平均,从而获得与所述视频对应的描述语句。Sampling the key frames of the video to be described, and extracting the optical flow features between two adjacent frames of the original video, and then learning and obtaining the high-level feature expressions of the key frames and optical flow features through two fractal networks, and then input them into two Based on the recurrent neural network model of the LSTM unit, the output values of the two independent recurrent neural network models at each moment are weighted and averaged to obtain the description sentence corresponding to the video.
图1为本发明的整体流程图,包括如下步骤:Fig. 1 is the overall flowchart of the present invention, comprises the following steps:
(1)对待描述视频进行关键帧的采样,并提取原视频相邻两帧之间的光流特征;其中,对待描述视频提取光流特征具体操作为:(1) Sampling the key frame of the video to be described, and extracting the optical flow features between two adjacent frames of the original video; wherein, the specific operation of extracting the optical flow features of the video to be described is:
1、分别计算视频每相邻两帧的x方向和y方向上的光流值,并归一化到[0,255]的像素范围;1. Calculate the optical flow values in the x-direction and y-direction of each adjacent two frames of the video, and normalize to the pixel range of [0,255];
2、计算光流的幅度值,并结合上一步获得的光流特征值组合成一张光流图。2. Calculate the amplitude value of the optical flow, and combine the optical flow eigenvalues obtained in the previous step to form an optical flow graph.
(2)通过两个分形网络分别学习并获得视频帧和光流特征的高层特征表达。对第一步获得的视频的采样帧以时间点的顺序依次输入到第一个处理空间维度关系的分形网络,通过网络的非线性映射关系依次生成对应的视觉特征矢量;对获得的光流图以时间点的顺序依次输入到第二个处理时间维度关系的分形网络,通过网络的非线性映射关系依次生成对应的运动特征矢量。(2) Learning and obtaining high-level feature representations of video frames and optical flow features through two fractal networks respectively. The sample frames of the video obtained in the first step are sequentially input to the first fractal network that processes the spatial dimension relationship in the order of time points, and the corresponding visual feature vectors are sequentially generated through the nonlinear mapping relationship of the network; for the obtained optical flow graph The order of time points is sequentially input to the second fractal network that deals with the time dimension relationship, and the corresponding motion feature vectors are sequentially generated through the nonlinear mapping relationship of the network.
分形网络主要是在神经网络的宏观架构上引进了一种基于自相似的设计策略,通过单一扩展规则的重复应用生成了一个极深的网络,而其结构布局是一个截断的分形。该网络包含长度不同的相互作用子路径,但不包含任何直通式连接。同时,为了实现提取高性能固定深度子网络的能力,采用了一种路径舍弃的方法正则化分形架构里子路径的协同适应。对于分形网络,训练的简单性与设计的简单性相对应,单个连接到最后一层的损失函数足以驱动内部行为去模仿深度监督。在本发明中所采用的分形网络是基于分形结构的深度卷积神经网络。The fractal network mainly introduces a design strategy based on self-similarity on the macro-architecture of the neural network. An extremely deep network is generated through the repeated application of a single expansion rule, and its structural layout is a truncated fractal. The network contains interacting subpaths of varying lengths, but does not contain any through-connections. At the same time, in order to achieve the ability to extract high-performance fixed-depth sub-networks, a method of path discarding is used to regularize the co-adaptation of sub-paths in the fractal architecture. For fractal networks, the simplicity of training corresponds to the simplicity of design, a single loss function connected to the last layer is sufficient to drive the internal behavior to mimic deep supervision. The fractal network used in the present invention is a deep convolutional neural network based on a fractal structure.
如图2中给出了分形结构的示意图,基础情形f1(z)包含输入输出之间单个选定类型的层;令C表示截断分形fC(·)的指标,fC(·)定义了网络架构、连接以及层类型。其中,基础情形是包含单个卷积层的网络表示如公式(1-1):A schematic diagram of the fractal structure is given in Fig. 2. The base case f 1 (z) contains a single layer of selected type between the input and the output; let C denote the index of the truncated fractal f C ( ), f C ( ) defines Network architecture, connections, and layer types are covered. Among them, the basic situation is that the network representation containing a single convolutional layer is represented by formula (1-1):
f1(z)=conv(z) (1-1)f 1 (z)=conv(z) (1-1)
然后通过递归定义接下来的分形结构如公式(1-2):Then recursively define the next fractal structure as formula (1-2):
在公式(1-2)中,表示复合,而表示连接操作,C对应于列数,或者说网络fC(·)的宽度;深度定义为从输入到输出的最长路径上的conv层的个数,正比于2C-1;用于分类的卷积网络通常分散布置汇集层;为了达到相同目的,使用fC(·)作为构建单元,将之与接下来的汇集层堆叠B次,得到总深度B·2C-1;连接操作把两个特征块合为一个,其中一个特征块是一个卷积层的结果:在一个空间区域为固定的一些通道维持活化的张量。通道数对应于前面的卷积层过滤器的个数。当分形被扩展,相邻的连接合并成单个连接层。如图2右侧所示,这个连接层跨越多列,把所有其输入特征块合并成单个输出块。In formula (1-2), means composite, and Represents the connection operation, C corresponds to the number of columns, or the width of the network f C (·); depth is defined as the number of conv layers on the longest path from input to output, proportional to 2 C-1 ; used for classification The convolutional network of usually dispersely arranges the pooling layer; in order to achieve the same purpose, use f C (·) as the building unit, and stack it with the next pooling layer B times to obtain the total depth B 2 C-1 ; the connection operation Combines two feature blocks into one, where one feature block is the result of a convolutional layer: a tensor of activations maintained for a fixed number of channels in a spatial region. The number of channels corresponds to the number of filters in the previous convolutional layer. When the fractal is expanded, adjacent connections are merged into a single connected layer. As shown on the right side of Figure 2, this concatenated layer spans multiple columns, combining all its input feature blocks into a single output block.
由于分形网络包含额外的大尺度结构,因此提出使用一种类似dropout和drop-connect的粗粒度正则化策略。路径舍弃通过随机丢弃连接层的操作数来禁止平行路径的共同适应,这种方式有效防止了网络使用一个路径作为锚标,另一个路径作为修正而可能引起的过拟合行为。这里主要采用两个采样策略:Since fractal networks contain additional large-scale structures, a coarse-grained regularization strategy similar to dropout and drop-connect is proposed. Path discarding prohibits the co-adaptation of parallel paths by randomly discarding the operands of the connection layer, which effectively prevents the network from using one path as an anchor and the other path as a correction that may cause overfitting behavior. There are mainly two sampling strategies used here:
对于局部,连接层以固定的概率舍弃每个输入,但保证至少保留一个输入;For local, the connection layer discards each input with a fixed probability, but guarantees to keep at least one input;
对于全局,每条路径是为了整个网络选出的,通过限制这条路径是单列的,以激励每列成为有力的预测器。Globally, each path is selected for the entire network, by restricting this path to a single column, incentivizing each column to be a strong predictor.
(3)分别将上一步获得的高层特征矢量输入到两个基于LSTM单元的递归神经网络。基于LSTM单元的递归神经网络包含两层LSTM单元,第一层和第二层分别包含1000个神经元,其中每个LSTM神经单元的前向传播过程可表示为:(3) Input the high-level feature vectors obtained in the previous step into two recurrent neural networks based on LSTM units. The recurrent neural network based on the LSTM unit contains two layers of LSTM units, the first layer and the second layer contain 1000 neurons respectively, and the forward propagation process of each LSTM neuron unit can be expressed as:
it=σ(Wxixt+Whiht-1+bi) (1-3)i t = σ(W xi x t +W hi h t-1 +b i ) (1-3)
ft=σ(Wxfxt+Whfht-1+bf) (1-4)f t = σ(W xf x t +W hf h t-1 +b f ) (1-4)
ot=σ(Wxoxt+Whoht-1+bo) (1-5)o t = σ(W xo x t +W ho h t-1 +b o ) (1-5)
ct=ft*ct-1+it*gt (1-7)c t =f t *c t-1 +i t *g t (1-7)
其中,σ(xt)=(1+e-xt)-1是sigmoid非线性激活函数,是双曲正切非线性激活函数;it,ft,ot,ct分别代表t时刻输入门,记忆门,输出门和核心门对应的状态量;对于每个逻辑门,Wxi,Wxf,Wxo,Wxg分别代表输入门,记忆门,输出门和核心门对应的权重转移矩阵,Whi,Whf,Who,Whg分别代表输入门,记忆门,输出门和核心门在t-1时刻隐藏层变量ht-1对应的权重转移矩阵,bi,bf,bo,bg分别代表输入门,记忆门,输出门和核心门对应的偏置向量。Among them, σ(x t )=(1+e -xt ) -1 is the sigmoid nonlinear activation function, is the hyperbolic tangent nonlinear activation function; it , f t , o t , c t respectively represent the state quantities corresponding to the input gate, memory gate, output gate and core gate at time t ; for each logic gate, W xi , W xf , W xo , W xg represent the weight transfer matrix corresponding to the input gate, memory gate, output gate and core gate respectively, W hi , W hf , Who ho , W hg represent the input gate, memory gate, output gate and core gate respectively The weight transfer matrix corresponding to the hidden layer variable h t -1 at time t-1, b i , b f , b o , b g represent the bias vectors corresponding to the input gate, memory gate, output gate and core gate, respectively.
如图3中给出的基于两层LSTM单元的递归神经网络结构图,我们利用这个两层堆叠的LSTM单元的递归神经网络进行对输入特征矢量的编码和解码的操作,从而实现自然语言文本的转换。其中,第一层LSTM神经元完成对每个时刻的输入视觉特征矢量的编码过程,然后每个时刻输出的隐层表达作为第二层LSTM神经元的输入;当所有视频帧的特征矢量都输入到第一层LSTM神经元后,第二层LSTM神经元就会收到一个指示符,并开始解码的任务。在解码的阶段,网络会有信息的损失,因此模型参数训练和学习的目标是在给定隐层表达和上一时刻的输出预测的前提下,最大化整个输出语句预测的对数似然函数。对于用参数θ和输出语句Y=(y1,y2,…,ym)表示的模型,参数优化目标可表示为:The recurrent neural network structure diagram based on the two-layer LSTM unit is given in Figure 3. We use the recurrent neural network of the two-layer stacked LSTM unit to encode and decode the input feature vector, thereby realizing the natural language text. convert. Among them, the first layer of LSTM neurons completes the encoding process of the input visual feature vector at each moment, and then the hidden layer expression output at each moment is used as the input of the second layer of LSTM neurons; when the feature vectors of all video frames are input After reaching the first layer of LSTM neurons, the second layer of LSTM neurons will receive an indicator and start the task of decoding. In the decoding stage, the network will lose information, so the goal of model parameter training and learning is to maximize the logarithmic likelihood function of the entire output sentence prediction given the hidden layer expression and the output prediction at the previous moment . For a model represented by parameter θ and output sentence Y=(y 1 ,y 2 ,…,y m ), the parameter optimization objective can be expressed as:
这里,θ为参数,Y代表输出的预测语句,h为隐层表达,使用随机梯度下降法对目标函数进行优化,整个网络的误差通过反向传播算法在时间维度上累积传递。Here, θ is a parameter, Y represents the output prediction sentence, h is the expression of the hidden layer, and the stochastic gradient descent method is used to optimize the objective function, and the error of the entire network is accumulated and transferred in the time dimension through the back propagation algorithm.
(4)将两个独立模型每个时刻的输出值进行加权平均并获得视频对应的描述语句,具体操作为:(4) Weighted average the output values of the two independent models at each moment and obtain the description sentence corresponding to the video. The specific operation is as follows:
1、将两个独立模型每个时刻的第二层LSTM神经元的输出值进行加权平均;1. Weighted average the output values of the second layer LSTM neurons at each moment of the two independent models;
2、采用softmax函数计算在词汇表V中每个单词的出现概率,表示为:2. Use the softmax function to calculate the occurrence probability of each word in the vocabulary V, expressed as:
其中,y表示预测的单词,zt表示递归神经网络在t时刻的输出值,Wy表示该单词在词汇表中的权重值。Among them, y represents the predicted word, z t represents the output value of the recurrent neural network at time t, and W y represents the weight value of the word in the vocabulary.
3、在每个时刻的解码阶段,取softmax函数输出值中最大概率的单词,从而获得对应的视频描述语句。3. In the decoding stage at each moment, take the word with the highest probability in the output value of the softmax function, so as to obtain the corresponding video description sentence.
上述实施例为本发明较佳的实施方式,但本发明的实施方式并不受上述实施例的限制,其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化,均应为等效的置换方式,都包含在本发明的保护范围之内。The above-mentioned embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, Simplifications should be equivalent replacement methods, and all are included in the protection scope of the present invention.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710111507.8A CN106934352A (en) | 2017-02-28 | 2017-02-28 | A kind of video presentation method based on two-way fractal net work and LSTM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710111507.8A CN106934352A (en) | 2017-02-28 | 2017-02-28 | A kind of video presentation method based on two-way fractal net work and LSTM |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106934352A true CN106934352A (en) | 2017-07-07 |
Family
ID=59424160
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710111507.8A Pending CN106934352A (en) | 2017-02-28 | 2017-02-28 | A kind of video presentation method based on two-way fractal net work and LSTM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106934352A (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107644519A (en) * | 2017-10-09 | 2018-01-30 | 中电科新型智慧城市研究院有限公司 | A kind of intelligent alarm method and system based on video human Activity recognition |
CN107909014A (en) * | 2017-10-31 | 2018-04-13 | 天津大学 | A kind of video understanding method based on deep learning |
CN107909115A (en) * | 2017-12-04 | 2018-04-13 | 上海师范大学 | A kind of image Chinese subtitle generation method |
CN108198202A (en) * | 2018-01-23 | 2018-06-22 | 北京易智能科技有限公司 | A kind of video content detection method based on light stream and neural network |
CN108228915A (en) * | 2018-03-29 | 2018-06-29 | 华南理工大学 | A kind of video retrieval method based on deep learning |
CN108235116A (en) * | 2017-12-27 | 2018-06-29 | 北京市商汤科技开发有限公司 | Feature propagation method and device, electronic equipment, program and medium |
CN108470212A (en) * | 2018-01-31 | 2018-08-31 | 江苏大学 | A kind of efficient LSTM design methods that can utilize incident duration |
CN108536735A (en) * | 2018-03-05 | 2018-09-14 | 中国科学院自动化研究所 | Multi-modal lexical representation method and system based on multichannel self-encoding encoder |
CN109284682A (en) * | 2018-08-21 | 2019-01-29 | 南京邮电大学 | A gesture recognition method and system based on STT-LSTM network |
CN109460812A (en) * | 2017-09-06 | 2019-03-12 | 富士通株式会社 | Average information analytical equipment, the optimization device, feature visualization device of neural network |
CN109522451A (en) * | 2018-12-13 | 2019-03-26 | 连尚(新昌)网络科技有限公司 | Repeat video detecting method and device |
CN109753897A (en) * | 2018-12-21 | 2019-05-14 | 西北工业大学 | Behavior recognition method based on memory unit reinforcement-temporal dynamic learning |
CN109785336A (en) * | 2018-12-18 | 2019-05-21 | 深圳先进技术研究院 | Image partition method and device based on multipath convolutional neural networks model |
CN110008789A (en) * | 2018-01-05 | 2019-07-12 | 中国移动通信有限公司研究院 | Multiclass object detection and knowledge method for distinguishing, equipment and computer readable storage medium |
CN110019952A (en) * | 2017-09-30 | 2019-07-16 | 华为技术有限公司 | Video presentation method, system and device |
CN110084259A (en) * | 2019-01-10 | 2019-08-02 | 谢飞 | A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature |
CN110197195A (en) * | 2019-04-15 | 2019-09-03 | 深圳大学 | A kind of novel deep layer network system and method towards Activity recognition |
CN110475129A (en) * | 2018-03-05 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Method for processing video frequency, medium and server |
CN110531163A (en) * | 2019-04-18 | 2019-12-03 | 中国人民解放军国防科技大学 | Bus capacitance state monitoring method for suspension chopper of maglev train |
CN111767765A (en) * | 2019-04-01 | 2020-10-13 | Oppo广东移动通信有限公司 | Video processing method, device, storage medium and electronic device |
CN111814589A (en) * | 2020-06-18 | 2020-10-23 | 浙江大华技术股份有限公司 | Part identification method and related equipment and device |
CN112912888A (en) * | 2018-10-31 | 2021-06-04 | 华为技术有限公司 | Apparatus and method for identifying video activity |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279495A (en) * | 2015-10-23 | 2016-01-27 | 天津大学 | Video description method based on deep learning and text summarization |
CN106407649A (en) * | 2016-08-26 | 2017-02-15 | 中国矿业大学(北京) | Onset time automatic picking method of microseismic signal on the basis of time-recursive neural network |
-
2017
- 2017-02-28 CN CN201710111507.8A patent/CN106934352A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279495A (en) * | 2015-10-23 | 2016-01-27 | 天津大学 | Video description method based on deep learning and text summarization |
CN106407649A (en) * | 2016-08-26 | 2017-02-15 | 中国矿业大学(北京) | Onset time automatic picking method of microseismic signal on the basis of time-recursive neural network |
Non-Patent Citations (4)
Title |
---|
GUSTAV LARSSON ET AL.: "FractalNet:Ultra-Deep Neural Networks without Residuals", 《ARXIV:1605.07648V2》 * |
JOE YUE-HEI NG ET AL.: "Beyond Short Snippets:Deep Networks for Videos Classification", 《IEEE》 * |
KAREN SIMONYAN ET AL.: "Two-Stream Convolutional Networks for Action Recognition in Videos", 《ARXIV:1406.2199V2》 * |
SUBHASHINI VENUGOPALAN ET AL.: "Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text", 《ARXIV :1604.01729V 1》 * |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460812A (en) * | 2017-09-06 | 2019-03-12 | 富士通株式会社 | Average information analytical equipment, the optimization device, feature visualization device of neural network |
CN110019952B (en) * | 2017-09-30 | 2023-04-18 | 华为技术有限公司 | Video description method, system and device |
CN110019952A (en) * | 2017-09-30 | 2019-07-16 | 华为技术有限公司 | Video presentation method, system and device |
CN107644519A (en) * | 2017-10-09 | 2018-01-30 | 中电科新型智慧城市研究院有限公司 | A kind of intelligent alarm method and system based on video human Activity recognition |
CN107909014A (en) * | 2017-10-31 | 2018-04-13 | 天津大学 | A kind of video understanding method based on deep learning |
CN107909115A (en) * | 2017-12-04 | 2018-04-13 | 上海师范大学 | A kind of image Chinese subtitle generation method |
CN108235116A (en) * | 2017-12-27 | 2018-06-29 | 北京市商汤科技开发有限公司 | Feature propagation method and device, electronic equipment, program and medium |
CN108235116B (en) * | 2017-12-27 | 2020-06-16 | 北京市商汤科技开发有限公司 | Feature propagation method and apparatus, electronic device, and medium |
CN110008789A (en) * | 2018-01-05 | 2019-07-12 | 中国移动通信有限公司研究院 | Multiclass object detection and knowledge method for distinguishing, equipment and computer readable storage medium |
CN108198202A (en) * | 2018-01-23 | 2018-06-22 | 北京易智能科技有限公司 | A kind of video content detection method based on light stream and neural network |
CN108470212A (en) * | 2018-01-31 | 2018-08-31 | 江苏大学 | A kind of efficient LSTM design methods that can utilize incident duration |
CN108470212B (en) * | 2018-01-31 | 2020-02-21 | 江苏大学 | An Efficient LSTM Design Method Using Event Duration |
CN108536735A (en) * | 2018-03-05 | 2018-09-14 | 中国科学院自动化研究所 | Multi-modal lexical representation method and system based on multichannel self-encoding encoder |
CN110475129A (en) * | 2018-03-05 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Method for processing video frequency, medium and server |
CN108536735B (en) * | 2018-03-05 | 2020-12-15 | 中国科学院自动化研究所 | Method and system for multimodal lexical representation based on multichannel autoencoder |
CN108228915A (en) * | 2018-03-29 | 2018-06-29 | 华南理工大学 | A kind of video retrieval method based on deep learning |
CN109284682A (en) * | 2018-08-21 | 2019-01-29 | 南京邮电大学 | A gesture recognition method and system based on STT-LSTM network |
CN112912888A (en) * | 2018-10-31 | 2021-06-04 | 华为技术有限公司 | Apparatus and method for identifying video activity |
CN109522451A (en) * | 2018-12-13 | 2019-03-26 | 连尚(新昌)网络科技有限公司 | Repeat video detecting method and device |
CN109522451B (en) * | 2018-12-13 | 2024-02-27 | 连尚(新昌)网络科技有限公司 | Repeated video detection method and device |
CN109785336A (en) * | 2018-12-18 | 2019-05-21 | 深圳先进技术研究院 | Image partition method and device based on multipath convolutional neural networks model |
CN109785336B (en) * | 2018-12-18 | 2020-11-27 | 深圳先进技术研究院 | Image segmentation method and device based on multi-path convolutional neural network model |
CN109753897A (en) * | 2018-12-21 | 2019-05-14 | 西北工业大学 | Behavior recognition method based on memory unit reinforcement-temporal dynamic learning |
CN109753897B (en) * | 2018-12-21 | 2022-05-27 | 西北工业大学 | Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning |
CN110084259B (en) * | 2019-01-10 | 2022-09-20 | 谢飞 | Facial paralysis grading comprehensive evaluation system combining facial texture and optical flow characteristics |
CN110084259A (en) * | 2019-01-10 | 2019-08-02 | 谢飞 | A kind of facial paralysis hierarchical synthesis assessment system of combination face texture and Optical-flow Feature |
CN111767765A (en) * | 2019-04-01 | 2020-10-13 | Oppo广东移动通信有限公司 | Video processing method, device, storage medium and electronic device |
CN110197195B (en) * | 2019-04-15 | 2022-12-23 | 深圳大学 | Novel deep network system and method for behavior recognition |
CN110197195A (en) * | 2019-04-15 | 2019-09-03 | 深圳大学 | A kind of novel deep layer network system and method towards Activity recognition |
CN110531163A (en) * | 2019-04-18 | 2019-12-03 | 中国人民解放军国防科技大学 | Bus capacitance state monitoring method for suspension chopper of maglev train |
CN111814589A (en) * | 2020-06-18 | 2020-10-23 | 浙江大华技术股份有限公司 | Part identification method and related equipment and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106934352A (en) | A kind of video presentation method based on two-way fractal net work and LSTM | |
CN111985245B (en) | Relationship extraction method and system based on attention cycle gating graph convolution network | |
CN113515951B (en) | A story description generation method based on knowledge-augmented attention network and group-level semantics | |
Wang et al. | TRC‐YOLO: A real‐time detection method for lightweight targets based on mobile devices | |
CN108664589A (en) | Text message extracting method, device, system and medium based on domain-adaptive | |
CN110866542A (en) | Depth representation learning method based on feature controllable fusion | |
CN113064968A (en) | Social media emotion analysis method and system based on tensor fusion network | |
CN111597929A (en) | Group Behavior Recognition Method Based on Channel Information Fusion and Group Relationship Spatial Structured Modeling | |
CN117786475A (en) | Dynamic network-based multi-task rumor detection model and method | |
Zhenhua et al. | FTCF: Full temporal cross fusion network for violence detection in videos | |
CN115953902B (en) | A traffic flow prediction method based on multi-view spatiotemporal graph convolutional network | |
CN116663523A (en) | Semantic text similarity calculation method for multi-angle enhanced network | |
Zhao et al. | Human action recognition based on improved fusion attention CNN and RNN | |
Liu et al. | A Recommendation Model Utilizing Separation Embedding and Self-Attention for Feature Mining | |
CN110263638A (en) | A kind of video classification methods based on significant information | |
Mai et al. | From Efficient Multimodal Models to World Models: A Survey | |
CN110245292B (en) | Natural language relation extraction method based on neural network noise filtering characteristics | |
CN113657272B (en) | A micro-video classification method and system based on missing data completion | |
Wang et al. | A Two-channel model for relation extraction using multiple trained word embeddings | |
Zhou et al. | What happens next? Combining enhanced multilevel script learning and dual fusion strategies for script event prediction | |
CN118606469A (en) | Multi-classification prediction method for intangible cultural heritage text based on multi-head attention and semantic features | |
CN118262533A (en) | Traffic flow prediction method based on self-adaptive dynamic fusion graph convolution network | |
CN118334588A (en) | A video crowd anomaly detection method based on attention fusion and residual structure | |
CN116050523A (en) | Attention-directed enhanced common sense reasoning framework based on mixed knowledge graph | |
CN117132885A (en) | A hyperspectral image classification method, system and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170707 |