CN110059584B

CN110059584B - Event naming method combining boundary distribution and correction

Info

Publication number: CN110059584B
Application number: CN201910245568.2A
Authority: CN
Inventors: 田茜; 郑慧诚; 王腾
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-03-28
Filing date: 2019-03-28
Publication date: 2023-06-02
Anticipated expiration: 2039-03-28
Also published as: CN110059584A

Abstract

The invention provides an event naming method combining boundary distribution and correction, which forms an event naming network by constructing a starting point distribution network, an end point distribution network and a boundary circulation correction network; training and updating the event nomination network by constructing an event nomination network loss function; performing nomination prediction on the video event by using the event nomination network after training and updating; the starting point distribution network and the ending point distribution network are used for predicting event nomination; the boundary circulation correction network is used for generating predicted event nomination bias information and correcting event nomination boundaries. According to the event naming method combining boundary distribution and correction, provided by the invention, the event naming fitting the real event distribution is generated by combining the event starting and ending point distribution rule in the real video, and the boundary of the event naming is corrected by utilizing a cyclic correction network, so that the event naming which is more in line with the real event and enables the boundary of the event to be more accurate is obtained.

Description

An event nomination method combining boundary distribution and correction

技术领域Technical Field

本发明涉及计算机视觉技术领域，更具体的，涉及一种结合边界分布与纠正的事件提名方法。The present invention relates to the field of computer vision technology, and more specifically, to an event nomination method combining boundary distribution and correction.

背景技术Background Art

随着互联网及便携式设备的高速发展，拍摄视频变得更加的方便容易，大量的视频被上传到了互联网上，这些视频的内容，时间长短等都有较大的差异性。大多数基于视频的计算机视觉算法都是在长视频上进行一定的修剪之后再对修剪后的短视频进行分析处理，如动作识别。而对长视频的修剪带来了大量的人力成本和时间成本的消耗，为了满足现实生活的需要，对未修剪的长视频进行处理分析变得很有必要。With the rapid development of the Internet and portable devices, video shooting has become more convenient and easy. A large number of videos have been uploaded to the Internet. The content and length of these videos vary greatly. Most video-based computer vision algorithms perform some trimming on long videos and then analyze and process the trimmed short videos, such as action recognition. However, trimming long videos consumes a lot of manpower and time costs. In order to meet the needs of real life, it becomes necessary to process and analyze the untrimmed long videos.

密集视频描述[1]正是基于未修剪的长视频提出的一个新任务，目标是使用自然语言对视频中发生的多个事件分别进行描述。密集视频描述可以分为两个部分，一是对视频中的事件进行定位，找到视频中所有事件的起止时间，即提取视频中的事件提名；二是使用自然语言对定位出的事件进行描述。其潜在应用非常的广泛，如婴幼儿早期教育、盲人日常辅导、电影字幕、视频检索和分类等。Dense video description [1] is a new task based on untrimmed long videos. Its goal is to use natural language to describe multiple events in the video. Dense video description can be divided into two parts: one is to locate the events in the video and find the start and end time of all events in the video, that is, to extract event nominations in the video; the other is to describe the located events using natural language. Its potential applications are very wide, such as early childhood education, daily tutoring for the blind, movie subtitles, video retrieval and classification, etc.

鲁棒的密集视频描述依赖于高质量的事件提名，不仅要求产生的事件提名在时间尺度上能覆盖所有可能发生的事件的时间跨度，而且需要产生的事件提名的边界与视频中的真实事件一样。相较于目标检测等基于图像的任务，高质量的事件提名不仅需要提取视频中的对象信息，还需要对视频中对象在时间上的运动信息进行提取和分析从而得到相关的动态信息。实际生活中视频生成条件不受限制，视频中的多事件在时间上存在一定的重叠，且视频的拍摄角度、拍摄距离等存在一定的变化，这都给事件提名带来了较大的挑战。Robust dense video description relies on high-quality event nominations, which not only requires that the generated event nominations cover the time span of all possible events in terms of time scale, but also requires that the boundaries of the generated event nominations be the same as the real events in the video. Compared with image-based tasks such as object detection, high-quality event nominations require not only the extraction of object information in the video, but also the extraction and analysis of the temporal motion information of the objects in the video to obtain relevant dynamic information. In real life, video generation conditions are not restricted, multiple events in the video overlap in time, and the shooting angle and shooting distance of the video vary to a certain extent, which brings great challenges to event nomination.

视频中的事件提名与图像中的目标检测有一定相似性，目前关于事件提名的许多研究工作都受到目标检测启发。不过在事件提名中，不仅需要关注视频的表观特征，还需要关注时序上的动态特征，且事件的时间跨度变化往往比较大。Shou等[2]将不同尺度的滑动窗口应用于视频特征序列，使用[3]提取窗口中的特征，进一步预测事件提名。但该方法只能得到有限个长度预定的事件提名，不能灵活适应事件的真实长度。另外，不同尺度的滑动窗口对同一帧的重复扫描也会带来冗余计算。Gao等[4]则进一步对基于滑动窗口的事件提名边界进行回归，以获得更为灵活而准确的提名边界。Chao等[5]将Faster R-CNN[6]框架拓展成双流网络结构应用在视频上，将膨胀卷积应用于时序以扩大感受野，从而得到时间跨度更大的事件提名。得益于循环神经网络对视频特征提取的记忆性，[7]将滑窗内的视频特征输入循环神经网络，以预测不同尺度的事件提名的起止时间及其置信度。但是滑动窗口通常会带来大量重复运算，[8]通过在循环神经网络的每一个节点预测多个不同尺度的事件提名，避免了在不同尺度上的重复计算。但是该方法预测的事件提名边界固定，且其边界的精度受限于特征提取跨度，难以准确拟合真实事件边界。Event nomination in videos is similar to object detection in images. Currently, many research works on event nomination are inspired by object detection. However, in event nomination, we need to pay attention not only to the apparent features of the video, but also to the dynamic features in the time series, and the time span of events often varies greatly. Shou et al. [2] applied sliding windows of different scales to video feature sequences, and used [3] to extract features in the window and further predict event nominations. However, this method can only obtain event nominations of a limited number of predetermined lengths and cannot flexibly adapt to the actual length of the event. In addition, repeated scanning of the same frame by sliding windows of different scales will also lead to redundant calculations. Gao et al. [4] further regressed the event nomination boundary based on the sliding window to obtain a more flexible and accurate nomination boundary. Chao et al. [5] extended the Faster R-CNN [6] framework into a two-stream network structure for video, and applied dilated convolution to the time series to expand the receptive field, thereby obtaining event nominations with a larger time span. Thanks to the memory of video feature extraction by recurrent neural networks, [7] input the video features in the sliding window into the recurrent neural network to predict the start and end time and confidence of event nominations of different scales. However, sliding windows usually lead to a large number of repeated operations. [8] By predicting multiple event nominations of different scales at each node of the recurrent neural network, repeated calculations at different scales can be avoided. However, the event nomination boundary predicted by this method is fixed, and the accuracy of its boundary is limited by the feature extraction span, making it difficult to accurately fit the real event boundary.

发明内容Summary of the invention

本发明为克服上述现有的基于循环神经网络的事件提名方法提名边界固定，其边界的精度受限于特征提取跨度，存在难以拟合真实事件边界的技术缺陷，提供一种结合边界分布与纠正的事件提名方法。In order to overcome the technical defects of the above-mentioned existing event nomination method based on recurrent neural network, the nomination boundary is fixed, the accuracy of the boundary is limited by the feature extraction span, and it is difficult to fit the real event boundary, the present invention provides an event nomination method combining boundary distribution and correction.

为解决上述技术问题，本发明的技术方案如下：In order to solve the above technical problems, the technical solution of the present invention is as follows:

一种结合边界分布与纠正的事件提名方法，通过构建起点分布网络、终点分布网络和边界循环修正网络形成事件提名网络；通过构建事件提名网络损失函数对事件提名网络进行训练更新；利用训练更新后的事件提名网络对视频事件进行提名预测；An event nomination method combining boundary distribution and correction, forming an event nomination network by constructing a starting point distribution network, an end point distribution network and a boundary cycle correction network; training and updating the event nomination network by constructing an event nomination network loss function; and using the event nomination network after training and updating to nominate video events.

所述起点分布网络、终点分布网络用于事件提名预测；The starting point distribution network and the end point distribution network are used for event nomination prediction;

所述边界循环修正网络用于产生预测的事件提名的偏置信息，对事件提名边界进行修正。The boundary loop correction network is used to generate bias information of predicted event nominations and correct the event nomination boundaries.

其中，所述起点分布网络、终点分布网络构建过程为：The construction process of the starting point distribution network and the end point distribution network is as follows:

将现有数据集视频长度进行归一化，确定事件起止点在视频中的相对位置；Normalize the length of the existing dataset videos to determine the relative positions of the event start and end points in the video;

统计数据集中所有事件起止点在视频中的相对位置，取视频中的所有事件起止点在视频时间线上的概率分布w_s0,w_e0；w_s0,w_e0分别表示事件的起点和终点概率分布，得到起点分布网络、终点分布网络。The relative positions of the start and end points of all events in the data set in the video are calculated, and the probability distributions of the start and end points of all events in the video on the video timeline are taken as _ws0 , _we0 ; _ws0 , _we0 represent the probability distributions of the start and end points of the event respectively, and the start point distribution network and the end point distribution network are obtained.

其中，所述起点分布网络、终点分布网络进行事件提名预测的过程具体为：The process of event nomination prediction by the starting point distribution network and the end point distribution network is specifically as follows:

通过三维卷积网络获取样本视频的视频特征，基于起点分布网络和终点分布网络，将得到的视频特征利用循环神经网络进行计算，得到起点分布网络和终点分布网络中每一个时间点输出的视频特征；The video features of the sample video are obtained through a three-dimensional convolutional network. Based on the starting point distribution network and the end point distribution network, the obtained video features are calculated using a recurrent neural network to obtain the video features output at each time point in the starting point distribution network and the end point distribution network.

在起点分布网络和终点分布网络中的每一个时间点输出K个置信度，表示K个固定长度的事件提名的可能性，这K个事件提名的长度为：At each time point in the origin distribution network and the destination distribution network, K confidences are output, indicating the possibility of K fixed-length event nominations. The length of these K event nominations is:

[t-k,t+1],k∈[0,K]；[t-k,t+1],k∈[0,K];

其中t与k的值满足t≥k，t的值随着视频长度的变化而变化；置信度越高，该区间是事件提名的可能性越大；所述起点分布网络和终点分布网络中相对应的事件提名的置信度之和作为最后该事件的置信度，从而完成事件提名的预测。The values of t and k satisfy t≥k, and the value of t changes with the length of the video; the higher the confidence, the greater the possibility that the interval is an event nomination; the sum of the confidences of the corresponding event nominations in the starting point distribution network and the end point distribution network is taken as the final confidence of the event, thereby completing the prediction of the event nomination.

其中，所述边界循环修正网络由两层循环神经网络构建而成，其中，第一层的循环神经网络用于根据样本视频的视频特征计算每一个时间点输出的视频特征；第二层的循环神经网络用于产生预测的事件提名的偏置信息，对事件提名边界进行修正。The boundary cycle correction network is constructed by two layers of recurrent neural networks, wherein the recurrent neural network of the first layer is used to calculate the video features output at each time point according to the video features of the sample video; the recurrent neural network of the second layer is used to generate bias information of the predicted event nomination and correct the event nomination boundary.

其中，产生预测的事件提名的偏置信息，对事件提名边界进行修正过程具体为：Among them, the bias information of the predicted event nomination is generated, and the process of correcting the event nomination boundary is specifically as follows:

根据预测的事件提名计算事件提名的中心坐标偏移量Δc和事件提名的尺度变化因子Δl，具体计算公式为：According to the predicted event nomination, the center coordinate offset Δc of the event nomination and the scale change factor Δl of the event nomination are calculated. The specific calculation formula is:

Δc＝(G_c-P_c)/P_c；Δc＝(G _c −P _c )/P _c ;

Δl＝log(G_l/P_l)；Δl＝log(G _l /P _l );

其中：G_c表示真实的事件提名中心坐标，P_c表示预测的事件提名中心坐标；G_l表示真实的事件提名尺度大小，P_l表示预测的事件提名尺度大小；Where: G _c represents the real event nomination center coordinates, P _c represents the predicted event nomination center coordinates; G _l represents the real event nomination scale, P _l represents the predicted event nomination scale;

将事件提名的中心坐标偏移量Δc和事件提名的尺度变化因子Δl作为监督信号，根据监督信号利用L1范数对第二层循环神经网络进行训练，得到预测的事件提名的偏置信息，记为Δc'和Δl'；The center coordinate offset Δc of the event nomination and the scale change factor Δl of the event nomination are used as supervisory signals. The second-layer recurrent neural network is trained using the L1 norm according to the supervisory signals to obtain the bias information of the predicted event nomination, which is recorded as Δc' and Δl'.

通过偏置信息Δc'和Δl'对预测的事件提名边界进行修正，得到修正后的事件提名的中心位置P_c'和尺度大小P_l'，具体为：The predicted event nomination boundary is corrected by the bias information Δc' and Δl' to obtain the center position P _c 'and scale P _l 'of the corrected event nomination, specifically:

P_c'＝P_c/(1+Δ'_c)；P _c ′＝P _c /(1+Δ′ _c );

根据修正后的事件提名的中心位置P_c'和尺度大小P_l'得到修正后的起止时间，具体为：According to the revised event nomination center position P _c 'and scale size P _l ', the revised start and end time are obtained, specifically:

其中，P′_start表示修正后的事件提名起始时间，P′_end表示修正后的事件提名终止时间，至此完成对预测的事件提名边界的修正。Among them, P′ _start represents the corrected event nomination start time, and P′ _end represents the corrected event nomination end time, thus completing the correction of the predicted event nomination boundary.

其中，所述事件提名网络损失函数由起点分布网络损失子函数、终点分布网络损失子函数和边界循环修正网络损失子函数加权叠加而成；其中所述起点分布网络损失子函数loss_s(c_s,t,X,y_s)具体为：The event nomination network loss function is formed by weighted superposition of the starting point distribution network loss sub-function, the end point distribution network loss sub-function and the boundary cycle correction network loss sub-function; wherein the starting point distribution network loss sub-function loss _s (c _s , t, X, y _s ) is specifically:

所述终点分布网络损失子函数loss_e(c_e,t,X,y_e)具体为：The endpoint distribution network loss sub-function loss _e ( _ce , t, X, _ye ) is specifically:

所述边界循环修正网络损失子函数loss_reg(t_i)具体为：The boundary cycle correction network loss sub-function loss _reg (t _i ) is specifically:

其中，X表示整个数据集；

为第t个时间点的第k个事件提名是否为真的监督信号，

为该事件提名在起始分布网络和终止分布网络下的置信度；Among them, X represents the entire data set;

A supervisory signal for whether the k-th event at the t-th time point is true or not,

Nominate the confidence level of the event under the starting distribution network and the ending distribution network;

其中，K表示每一个时间点输出的事件提名个数，与置信度相同；Δc_k与Δl_k分别为在视频第t_i个时间点对第k个事件提名进行修正的监督信号；Δc'_k与Δl'_k则为对同一时间点对同一事件提名预测的偏置信息；Among them, K represents the number of event nominations output at each time point, which is the same as the confidence level; Δc _k and Δl _k are the supervisory signals for correcting the k-th event nomination at the t _i- th time point of the video; Δc' _k and Δl' _k are the bias information for predicting the same event nomination at the same time point;

因此，所述损失函数loss(c,t,X,y)具体为：Therefore, the loss function loss(c, t, X, y) is specifically:

loss(c，t，x，y)＝α*loss_s(c_s，t，x，y_s)+β*loss_e(c_e，t，x，y_e)+γ*loss_reg(t_i)；loss (c, t, x, y) = α*loss _s (c _s , t, x, y _s ) + β*loss _e (c _e , t, x, y _e ) + γ*loss _reg (t _i );

其中，α、β、γ为三个子损失函数的权重系数。Among them, α, β, and γ are the weight coefficients of the three sub-loss functions.

其中，利用所述损失函数对事件提名网络中的所有的循环神经网络进行训练更新，从而完成事件提名网络的训练更新，得到训练更新后的事件提名网络。The loss function is used to train and update all recurrent neural networks in the event nomination network, thereby completing the training and updating of the event nomination network and obtaining the event nomination network after training and updating.

其中，利用训练更新后的事件提名网络对视频事件进行提名预测的具体过程为：The specific process of using the event nomination network after training and updating to nominate and predict video events is as follows:

S1：通过三维卷积网络获取样本视频的视频特征；S1: Obtain video features of sample videos through a 3D convolutional network;

S2：利用训练更新后的循环神经网络对本视频的视频特征进行计算，分别得到起点分布网络、终点分布网络和边界循环修正网络每一个时间点输出的视频特征；S2: Calculate the video features of the video using the trained and updated recurrent neural network, and obtain the video features output at each time point by the starting point distribution network, the end point distribution network, and the boundary cycle correction network;

S3：分别在起点分布网络和终点分布网络中的每一个时间点输出多个置信度，并将起点分布网络和终点分布网络中相对应的事件提名的置信度之和作为最后该事件的置信度，完成事件提名的预测；S3: Output multiple confidences at each time point in the starting point distribution network and the end point distribution network respectively, and take the sum of the confidences of the corresponding event nominations in the starting point distribution network and the end point distribution network as the final confidence of the event, thereby completing the prediction of the event nomination;

S4：边界循环修正网络产生预测的事件提名的偏置信息；S4: Boundary loops correct bias information for event nominations generated by the network;

S5：对事件的置信度从大到小进行排序，取前1000个事件提名并根据相应的偏置信息对其进行修正，得到最终预测的事件提名。S5: Sort the confidence of the events from large to small, take the first 1000 event nominations and correct them according to the corresponding bias information to obtain the final predicted event nominations.

与现有技术相比，本发明技术方案的有益效果是：Compared with the prior art, the technical solution of the present invention has the following beneficial effects:

本发明提供的一种结合边界分布与纠正的事件提名方法，结合了真实视频中的事件起止点分布规律产生拟合真实事件分布的事件提名，并利用循环修正网络对事件提名的边界进行修正，从而得到更加符合现实事件且使事件的边界更加精确的事件提名。The present invention provides an event nomination method combining boundary distribution and correction, which combines the distribution law of event start and end points in real videos to generate event nominations that fit the distribution of real events, and uses a cyclic correction network to correct the boundaries of event nominations, thereby obtaining event nominations that are more consistent with real events and make the boundaries of events more accurate.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明流程示意图。FIG1 is a schematic diagram of the process of the present invention.

具体实施方式DETAILED DESCRIPTION

附图仅用于示例性说明，不能理解为对本专利的限制；The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

为了更好说明本实施例，附图某些部件会有省略、放大或缩小，并不代表实际产品的尺寸；In order to better illustrate the present embodiment, some parts in the drawings may be omitted, enlarged or reduced, and do not represent the size of the actual product;

对于本领域技术人员来说，附图中某些公知结构及其说明可能省略是可以理解的。It is understandable to those skilled in the art that some well-known structures and their descriptions may be omitted in the drawings.

下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solution of the present invention is further described below in conjunction with the accompanying drawings and embodiments.

实施例1Example 1

如图1所示，一种结合边界分布与纠正的事件提名方法，通过构建起点分布网络、终点分布网络和边界循环修正网络形成事件提名网络；通过构建事件提名网络损失函数对事件提名网络进行训练更新；利用训练更新后的事件提名网络对视频事件进行提名预测；As shown in FIG1 , an event nomination method combining boundary distribution and correction is used to form an event nomination network by constructing a starting point distribution network, an end point distribution network and a boundary cycle correction network; the event nomination network is trained and updated by constructing an event nomination network loss function; and the event nomination network after training and updating is used to make nomination predictions for video events;

更具体的，所述起点分布网络、终点分布网络构建过程为：More specifically, the process of constructing the starting point distribution network and the end point distribution network is as follows:

在具体实施过程中，在进行数据集的统计时，将与真实事件的时间交叉比tIOU大于某一阈值σ的事件提名划分为正样本，对所有正样本的起止点分布进行统计。In the specific implementation process, when performing statistics on the data set, event nominations whose time intersection ratio tIOU with real events is greater than a certain threshold σ are classified as positive samples, and the start and end point distributions of all positive samples are statistically analyzed.

更具体的，所述起点分布网络、终点分布网络进行事件提名预测的过程具体为：More specifically, the process of event nomination prediction by the starting point distribution network and the end point distribution network is as follows:

[t-k,t+1],k∈[0,K]；[t-k,t+1],k∈[0,K];

更具体的，所述边界循环修正网络由两层循环神经网络构建而成，其中，第一层的循环神经网络用于根据样本视频的视频特征计算每一个时间点输出的视频特征；第二层的循环神经网络用于产生预测的事件提名的偏置信息，对事件提名边界进行修正。More specifically, the boundary cycle correction network is constructed by two layers of recurrent neural networks, wherein the recurrent neural network of the first layer is used to calculate the video features output at each time point based on the video features of the sample video; the recurrent neural network of the second layer is used to generate bias information of the predicted event nomination and correct the event nomination boundary.

更具体的，产生预测的事件提名的偏置信息，对事件提名边界进行修正过程具体为：More specifically, the bias information of the predicted event nomination is generated and the process of correcting the event nomination boundary is as follows:

Δc＝(G_c-P_c)/P_c；Δc＝(G _c −P _c )/P _c ;

Δl＝log(G_l/P_l)；Δl＝log(G _l /P _l );

P_c'＝P_c/(1+Δ'_c)；P _c ′＝P _c /(1+Δ′ _c );

更具体的，所述事件提名网络损失函数由起点分布网络损失子函数、终点分布网络损失子函数和边界循环修正网络损失子函数加权叠加而成；其中所述起点分布网络损失子函数loss_s(c_s,t,X,y_s)具体为：More specifically, the event nomination network loss function is formed by weighted superposition of a starting point distribution network loss sub-function, an end point distribution network loss sub-function and a boundary cycle correction network loss sub-function; wherein the starting point distribution network loss sub-function loss _s (c _s , t, X, y _s ) is specifically:

其中，X表示整个数据集；

为第t个时间点的第k个事件提名是否为真的监督信号，

loss(c，t，x，y)＝α*loss_s(c_s，t，x，y_s)+P*loss_e(c_e，t，x，y_e)+γ*loss_reg(t_i)；loss (c, t, x, y)=α*loss _s (c _s , t, x, y _s ) + P*loss _e (c _e , t, x, y _e ) + γ*loss _reg (t _i );

更具体的，利用所述损失函数对事件提名网络中的所有的循环神经网络进行训练更新，从而完成事件提名网络的训练更新，得到训练更新后的事件提名网络。More specifically, the loss function is used to train and update all recurrent neural networks in the event nomination network, thereby completing the training and updating of the event nomination network and obtaining the event nomination network after training and updating.

更具体的，利用训练更新后的事件提名网络对视频事件进行提名预测的具体过程为：More specifically, the specific process of using the trained and updated event nomination network to make nomination predictions for video events is as follows:

实施例2Example 2

在实施例1的基础上，更具体的，在ActivityNet上进行训练与验证，ActivityNet包含了20000个未修剪视频，共由849个小时，并且有约10万条描述语句的标记信息，在该数据集中，每一个视频都包含有若干个事件及对事件的描述语句，且同一视频中时间的起止时间及持续时间都不相同。ActivityNet包含三个部分：训练集、验证集以及测试集，分别有10024、4926、5044个视频，实施例主要在训练集和验证集上进行实验。On the basis of Example 1, more specifically, training and verification are performed on ActivityNet, which contains 20,000 untrimmed videos, totaling 849 hours, and about 100,000 descriptive statement labeling information. In this data set, each video contains several events and description statements for the events, and the start and end time and duration of the time in the same video are different. ActivityNet consists of three parts: a training set, a validation set, and a test set, with 10,024, 4,926, and 5,044 videos respectively. The embodiment mainly conducts experiments on the training set and the validation set.

使用三维卷积网络[9]提取特征时，每64帧提取出一个视频特征，特征维度使用主成分分析法压缩到500。使用的循环神经网络为长短期记忆网络LSTM，维度为512，模型中的K设置为256。在产生句子的语言模型中，每一句话的单词上限设置为32，且在建立词库时删掉了出现次数少于3的单词。先将事件提名网络训练至稳定后，再联合语言模型进行训练，学习率均设置为5e-5。When using a three-dimensional convolutional network [9] to extract features, a video feature is extracted every 64 frames, and the feature dimension is compressed to 500 using principal component analysis. The recurrent neural network used is a long short-term memory network LSTM with a dimension of 512, and K in the model is set to 256. In the language model for generating sentences, the upper limit of the number of words in each sentence is set to 32, and words with less than 3 occurrences are deleted when the vocabulary is established. The event nomination network is first trained to stability, and then trained in conjunction with the language model, and the learning rate is set to 5e-5.

在具体实施过程中，对事件提名的质量进行评估时，通常使用两个使用两个指标：召回率、准确率。召回率评估所预测的事件提名中覆盖了多少真实的事件，准确率对所预测的事件提名中有多少是正确的进行评估。此外，还有一个指标用于综合的对事件提名进行评估，即f₁分数，该指标同时兼顾了准确率和召回率，通过召回率及准确率进行计算得到，计算方法如下：In the specific implementation process, two indicators are usually used to evaluate the quality of event nominations: recall rate and precision rate. Recall rate evaluates how many real events are covered in the predicted event nominations, and precision rate evaluates how many of the predicted event nominations are correct. In addition, there is another indicator for comprehensive evaluation of event nominations, namely the _f1 score, which takes into account both precision and recall rate. It is calculated by recall rate and precision rate. The calculation method is as follows:

f₁＝2*(准确率*召回率/(准确率+召回率))；f ₁ = 2*(precision*recall/(precision+recall));

在ActivityNet上用本发明提出的结合边界分布与纠正的事件提名方法，与现有方法的对比如表1所示：The comparison between the event nomination method combining boundary distribution and correction proposed by the present invention and the existing method on ActivityNet is shown in Table 1:

表1Table 1

方法method 召回率@1000Recall @1000 准确率@1000Accuracy @1000 f₁分数@1000 _f1 score @1000 SST[5]SST[5] 0.7160.716 0.5330.533 0.5710.571 起止点建模Start and end point modeling 0.7310.731 0.5300.530 0.5730.573 边界回归Boundary Regression 0.7040.704 0.5610.561 0.5900.590 起止点分布+边界点回归Start and end point distribution + boundary point regression 0.7160.716 0.5600.560 0.5920.592

如表1所示，@1000表示置信度排名前1000的事件提名；将本发明所用方法与现有方法进行对比，SST[13]为本发明所用方法的基础，也是基于循环神经网络对事件提名进行预测。但是在SST中，没有对事件提名的起止点分布等数据进行处理，且其预测的事件提名边界固定。因此，通过对事件起止点进行统计，召回率有了明显的提升，且在结合对事件提名边界进行回归之后，召回率基本保持不变的情况下，准确率和f₁分数都有较大的提升。As shown in Table 1, @1000 indicates the event nominations ranked in the top 1000 in terms of confidence. The method used in the present invention is compared with the existing methods. SST[13] is the basis of the method used in the present invention, and it also predicts event nominations based on a recurrent neural network. However, in SST, data such as the distribution of the start and end points of event nominations are not processed, and the predicted event nomination boundaries are fixed. Therefore, by counting the start and end points of events, the recall rate is significantly improved, and after combining the regression of the event nomination boundaries, the recall rate remains basically unchanged, while the precision rate and _f1 score are greatly improved.

在具体实施过程中，对事件提名用于密集视频描述的表现进行评估时，主要的指标有BLEU-1，BLEU-2，BLEU-3，BLEU-4，Meteor，Rouge-L，CIDEr-D，用以评估事件提名网络产生的描述语句与真实描述语句之间的相似度。在上述指标中，Meteor由于其度量结果与人工判断的结果更相似，因此主要看重事件提名网络在这一指标上的表现。In the specific implementation process, the main indicators for evaluating the performance of event nomination for dense video description are BLEU-1, BLEU-2, BLEU-3, BLEU-4, Meteor, Rouge-L, and CIDEr-D, which are used to evaluate the similarity between the description sentences generated by the event nomination network and the real description sentences. Among the above indicators, Meteor mainly focuses on the performance of the event nomination network in this indicator because its measurement results are more similar to the results of manual judgment.

如表2所示，在密集视频描述任务上与现有的方法进行对比，本发明所述方法在大部分指标上都比现有方法的表现要好，特别是在Meteor指标上，表明了事件提名网络的有效性。As shown in Table 2, compared with existing methods on the dense video description task, the method of the present invention performs better than existing methods in most indicators, especially in the Meteor indicator, which shows the effectiveness of the event nomination network.

表2 ActivityNet Caption validation实验结果Table 2 ActivityNet Caption validation experimental results

显然，本发明的上述实施例仅仅是为清楚地说明本发明所作的举例，而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明权利要求的保护范围之内。Obviously, the above embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. For those skilled in the art, other different forms of changes or modifications can be made based on the above description. It is not necessary and impossible to list all the embodiments here. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection scope of the claims of the present invention.

[1]R.Krishna,K.Hata,F.Ren,L.Fei-Fei,and J.C.Niebles,“Dense-captioningevents in videos,”in Proc.IEEE International Conference on Computer Vision,2017,pp.706-715.[1] R.Krishna, K.Hata, F.Ren, L.Fei-Fei, and J.C.Niebles, "Dense-captioningevents in videos," in Proc.IEEE International Conference on Computer Vision, 2017, pp.706-715 .

[2]Z.Shou,D.Wang,and S.Chang,“Temporal action localization inuntrimmed videos via multi-stage CNNs,”in Proc.IEEE Conference on ComputerVision and Pattern Recognition,2016,pp.1049-1058.[2] Z.Shou, D.Wang, and S.Chang, "Temporal action localization inuntrimmed videos via multi-stage CNNs," in Proc. IEEE Conference on ComputerVision and Pattern Recognition, 2016, pp.1049-1058.

[3]S.Ji,W.Xu,M.Yang,and K.Yu,“3D convolutional neural networks forhuman action recognition,”IEEE Transactions on Pattern Analysis and MachineIntelligence,vol.35,no.1,pp.221-231,2013.[3]S.Ji, W.Xu, M.Yang, and K.Yu, "3D convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and MachineIntelligence, vol.35, no.1, pp.221- 231,2013.

[4]J.Gao,Z.Yang,C.Sun,K.Chen,and R.Nevatia,“TURN TAP:Temporal unitregression network for temporal action proposals,”in Proc.IEEE InternationalConference on Computer Vision,2017,pp.3648-3656.[4] J.Gao, Z.Yang, C.Sun, K.Chen, and R.Nevatia, "TURN TAP: Temporal unitregression network for temporal action proposals," in Proc.IEEE International Conference on Computer Vision, 2017, pp. 3648-3656.

[5]Y.Chao,S.Vijayanarasimhan,B.Seybold,D.A.Ross,J.Deng,andR.Sukthankar,“Rethinking the faster R-CNN architecture for temporal actionlocalization,”in Proc.IEEE Conference on Computer Vision and PatternRecognition,2018,pp.1130-1139.[5]Y.Chao, S.Vijayanarasimhan, B.Seybold, D.A.Ross,J.Deng,andR.Sukthankar, "Rethinking the faster R-CNN architecture for temporal actionlocalization," in Proc.IEEE Conference on Computer Vision and PatternRecognition, 2018, pp.1130-1139.

[6]S.Ren,K.He,R.Girshick,and J.Sun,“Faster R-CNN:Towards real-timeobject detection with region proposal networks,”IEEE Transactions on PatternAnalysis and Machine Intelligence,vol.39,no.6,pp.1137-1149,2017.[6] S.Ren, K.He, R.Girshick, and J.Sun, "Faster R-CNN: Towards real-timeobject detection with region proposal networks," IEEE Transactions on PatternAnalysis and Machine Intelligence, vol.39, no .6, pp.1137-1149, 2017.

[7]V.Escorcia,F.Caba,J.C.Niebles,and B.Ghanem,“Daps:Deep actionproposals for action understanding,”in Proc.European Conference on ComputerVision,2016,pp.768–784.[7] V.Escorcia, F.Caba, J.C.Niebles, and B.Ghanem, “Daps: Deep actionproposals for action understanding,” in Proc.European Conference on ComputerVision, 2016, pp.768–784.

[8]S.Buch,V.Escorcia,C.Shen,B.Ghanem,and J.C.Niebles,“SST:Single-stream temporal action proposals,”in Proc.IEEE Conference on Computer Visionand Pattern Recognition,2017,pp.6373-6382.[8] S.Buch, V.Escorcia, C.Shen, B.Ghanem, and J.C.Niebles, "SST: Single-stream temporal action proposals," in Proc.IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.6373 -6382.

[9]S.Ji,W.Xu,M.Yang,and K.Yu,“3D convolutional neural networks forhuman action recognition,”IEEE Transactions on Pattern Analysis and MachineIntelligence,vol.35,no.1,pp.221-231,2013.[9] S.Ji, W.Xu, M.Yang, and K.Yu, "3D convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and MachineIntelligence, vol.35, no.1, pp.221- 231,2013.

[10]R.Krishna,K.Hata,F.Ren,L.Fei-Fei,and J.C.Niebles,“Dense-captioning events in videos,”2017 IEEE International Conference on ComputerVision(ICCV),pp.706–715,2017.[10] R.Krishna, K.Hata, F.Ren, L.Fei-Fei, and J.C.Niebles, "Dense-captioning events in videos," 2017 IEEE International Conference on ComputerVision(ICCV), pp.706–715, 2017.

[11]Y.Li,T.Yao,Y.Pan,H.Chao,and T.Mei,“Jointly localizing anddescribing events for dense video captioning,”in Proc.IEEE Conference onComputer Vision and Pattern Recognition,2018,pp.7492-7500.[11] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, "Jointly localizing and describing events for dense video captioning," in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7492-7500.

[12]J.Wang,W.Jiang,L.Ma,W.Liu,and Y.Xu,“Bidirectional attentivefusion with context gating for dense video captioning,”in Proc.IEEEConference on Computer Vision and Pattern Recognition,2018,pp.7190-7198.[12] J.Wang, W.Jiang, L.Ma, W.Liu, and Y.Xu, "Bidirectional attentivefusion with context gating for dense video captioning," in Proc.IEEEConference on Computer Vision and Pattern Recognition, 2018, pp .7190-7198.

[13]S.Buch,V.Escorcia,C.Shen,B.Ghanem,and J.C.Niebles,“SST:Single-stream temporal action proposals,”in Proc.IEEE Conference on Computer Visionand Pattern Recognition,2017,pp.6373-6382.[13] S.Buch, V.Escorcia, C.Shen, B.Ghanem, and J.C.Niebles, "SST: Single-stream temporal action proposals," in Proc.IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.6373 -6382.

Claims

1. An event naming method combining boundary distribution and correction is characterized in that: forming an event nomination network by constructing a starting point distribution network, an ending point distribution network and a boundary circulation correction network; training and updating the event nomination network by constructing an event nomination network loss function; performing nomination prediction on the video event by using the event nomination network after training and updating;

the starting point distribution network and the ending point distribution network are used for predicting event nomination;

the boundary circulation correction network is used for generating predicted event nomination bias information and correcting event nomination boundaries;

the construction process of the starting point distribution network and the ending point distribution network comprises the following steps:

normalizing the video length of the existing data set, and determining the relative position of an event starting point in the video;

counting the relative positions of all event starting and stopping points in the video in the data set, and taking probability distribution w of all event starting and stopping points in the video on a video time line _s0 ,w _e0 ；w _s0 ,w _e0 Respectively representing the starting point probability distribution and the ending point probability distribution of the event to obtain a starting point distribution network and an ending point distribution network;

the process of predicting event nomination by the starting point distribution network and the ending point distribution network specifically comprises the following steps:

acquiring video features of a sample video through a three-dimensional convolution network, and calculating the acquired video features by using a cyclic neural network based on a starting point distribution network and an end point distribution network to acquire video features output by each time point in the starting point distribution network and the end point distribution network;

outputting K confidence degrees at each time point in the starting point distribution network and the end point distribution network, wherein the confidence degrees represent the possibility of K event nominations with fixed lengths, and the lengths of the K event nominations are as follows:

[t-k,t+1],k∈[0,K]；

wherein the values of t and k meet the condition that t is more than or equal to k, and the value of t is changed along with the change of the video length; the higher the confidence, the greater the likelihood of being an event nomination; and taking the sum of the confidence degrees of the corresponding event nomination in the starting point distribution network and the ending point distribution network as the confidence degree of the last event so as to complete the prediction of the event nomination.

2. The event naming method combining boundary distribution and correction according to claim 1, wherein the boundary circulation correction network is constructed by two layers of circulation neural networks, wherein the circulation neural network of the first layer is used for calculating video characteristics output at each time point according to video characteristics of the sample video; the second layer of cyclic neural network is used for generating predicted event nomination bias information and correcting event nomination boundaries.

3. The method for event naming by combining boundary distribution and correction according to claim 2, wherein the generating of the bias information of the predicted event naming and the correcting of the event naming boundary are specifically:

calculating the central coordinate offset delta c of the event nomination and the scale change factor delta l of the event nomination according to the predicted event nomination, wherein the specific calculation formula is as follows:

Δc＝(G _c -P _c )/P _c ；

Δl＝log(G _l /P _l )；

wherein: g _c Representing the actual event nomination center coordinates, P _c Representing predicted event nomination center coordinates; g _l Representing the size, P, of the actual event nomination scale _l Representing a predicted event nomination scale size;

taking the central coordinate offset delta c of event nomination and the scale change factor delta L of event nomination as supervision signals, training a second layer of circulating neural network by utilizing L1 norm according to the supervision signals to obtain predicted offset information of event nomination, and recording the predicted offset information as delta c 'and delta L';

predicted by the pair of offset information Δc' and ΔlCorrecting the event nomination boundary to obtain a corrected event nomination center position P '' _c And a dimension of P' _l The method specifically comprises the following steps:

P′ _c ＝P _c /(1+Δ' _c )；

center position P 'noted from corrected event' _c And dimension size P _l The "corrected start-stop time" is specifically:

wherein P' _start Indicating the corrected event nomination starting time, P' _end And indicating the corrected event nomination termination time, and finishing the correction of the predicted event nomination boundary.

4. The event naming method combining boundary distribution and correction according to claim 3, wherein the event naming network loss function is formed by weighted superposition of a start distribution network loss sub-function, an end distribution network loss sub-function and a boundary circulation correction network loss sub-function; wherein the origin distribution network loses the subfunction loss _s (c _s ,t,X,y _s ) The method comprises the following steps:

the endpoint distribution network loss subfunction loss _e (c _e ,t,X,y _e ) The method comprises the following steps:

the boundary circulation corrects the network loss subfunction loss _reg (t _i ) The method comprises the following steps:

wherein X represents the entire dataset;

whether the kth event nominated for the kth time point is a genuine supervision signal,

Suggesting confidence levels for the event under the start distribution network and the end distribution network;

k represents the number of event nominations output by each time point and is the same as the confidence coefficient; Δc _k And Deltal _k Respectively at the t th of video _i A supervision signal for correcting the nomination of the kth event at each time point; Δc' _k And Deltal' _k The bias information predicted by nominating the same event at the same time point is obtained;

therefore, the loss function loss (c, t, X, y) is specifically:

loss(c，t，X，y)＝α*loss _s (c _s ，t，X，y _s )+β*loss _e (c _e ，t，X，y _e )+γ*loss _reg (t _i )；

wherein, alpha, beta and gamma are weight coefficients of three sub-loss functions.

5. The method for event naming by combining boundary distribution and correction according to claim 4, wherein training update is performed on all the cyclic neural networks in the event naming network by using the loss function, so as to complete training update of the event naming network and obtain the event naming network after training update.

6. The method for event naming by combining boundary distribution and correction according to claim 5, wherein the specific process of predicting the video event by using the event naming network after training update is:

s1: acquiring video characteristics of a sample video through a three-dimensional convolution network;

s2: calculating video characteristics of the video by using the training updated cyclic neural network to respectively obtain video characteristics output by each time point of a starting point distribution network, an ending point distribution network and a boundary cyclic correction network;

s3: outputting a plurality of confidence coefficients at each time point in the starting point distribution network and the end point distribution network respectively, taking the sum of the confidence coefficients of the corresponding event nomination in the starting point distribution network and the end point distribution network as the confidence coefficient of the last event, and completing the prediction of the event nomination;

s4: the boundary circulation correction network generates bias information of predicted event nomination;

s5: and sequencing the confidence degrees of the events from large to small, taking the top 1000 event nominations, and correcting the event nominations according to corresponding bias information to obtain the final predicted event nominations.