CN108764148A

CN108764148A - Multizone real-time action detection method based on monitor video

Info

Publication number: CN108764148A
Application number: CN201810534453.0A
Authority: CN
Inventors: 陈东岳; 任方博; 王森; 贾同
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2018-11-06
Anticipated expiration: 2038-05-30
Also published as: CN108764148B

Abstract

The multizone real-time action detection method based on monitor video that the invention discloses a kind of having following steps：Model training stage and test phase, wherein model training stage is to obtain training data：The database of the specific action marked；The dense optical flow of the video sequence in training data is calculated, the light stream sequence of the video sequence in training data is obtained, and the light stream image in light stream sequence is labeled；Using in training data video sequence and light stream sequence target detection model yolo v3 are respectively trained, respectively obtain RGB yolo v3 models and light stream yolo v3 models.The present invention can not only realize the space-time position detection to specific action in monitor video, and can realize the real-time processing to monitoring.

Description

Multi-region real-time action detection method based on surveillance video

技术领域technical field

本发明属于计算机视觉领域，具体涉及监控视频场景下的人体动作检测系统。The invention belongs to the field of computer vision, and in particular relates to a human motion detection system in a monitoring video scene.

背景技术Background technique

随着监控设施的应用越来越普及，越多的基于监控的技术得到应用，动作识别作为其中很有价值的技术之一，主要应用于室内、工厂环境下人机设备的交互，以及公共环境安全领域用于特定危险动作的检测与识别。As the application of monitoring facilities becomes more and more popular, more and more monitoring-based technologies are applied. Action recognition, as one of the most valuable technologies, is mainly used in the interaction of human-machine equipment in indoor and factory environments, as well as in public environments. The security field is used for the detection and identification of specific dangerous actions.

大部分基于监控视频中的动作识别方法主要集中于整个场景的动作识别与分类任务上，这类视频一般是人工处理好的视频片段，视频片段中一般只包含一类动作，但是这种视频和自然的视频片段相差很大，还用一部分学者把研究任务放在检测动作在整个时间轴上发生的开始于接受的位置，但是在现实应用中获取视频中的动作的开始和结束以及动作在空间发生的范围都是很有用的，另外虽然现有的动作检测方法在现有的数据库以及竞赛中取得了很好的检测效果，但是这些方法一般都是通过把整个视频划分为很多的小块或者对整个视频进行处理，然后再输出这段视频中动作的时空位置，而要达到实时动作检测就要实现视频帧级别的处理，所以这类方法没有办法部署到监控系统中。Most of the action recognition methods based on surveillance video mainly focus on the action recognition and classification tasks of the entire scene. Such videos are generally artificially processed video clips, and the video clips generally only contain one type of action, but this kind of video and Natural video clips are very different, and some scholars put the research task on detecting the beginning and receiving position of the action on the entire time axis, but in real applications, the beginning and end of the action in the video and the action in the space The scope of occurrence is very useful. In addition, although existing motion detection methods have achieved good detection results in existing databases and competitions, these methods generally divide the entire video into many small blocks or Process the entire video, and then output the spatio-temporal position of the action in this video. To achieve real-time action detection, it is necessary to achieve video frame-level processing, so this method cannot be deployed in the monitoring system.

随着监控设备的普及，监控视频中人体动作的检测逐渐成为一个流行的研究领域，Wang L.,Qiao Y.,Tang X.的”Action recognition with trajectory-pooled deepconvolutional descriptors.”(在2015IEEE Conference on Computer Vision andPattern Recognition(CVPR)(2015)。)方法中通过整合深度神经网络提取视频特征和利用密集跟踪算法的到的特征。来实现对整个视频的动作识别，D.Tran,L.Bourdev,R.Fergus,L.Torresani,and M.Paluri.的”Learning spatiotemporal features with 3dconvolutional networks.”(在2015IEEE International Conference on ComputerVision(ICCV)(2015))方法提出用3D卷积和3D pooling来形成C3D框架来提取视频中的人体动作特征，Simonyan K,Zisserman A.的”Two-Stream Convolutional Networks forAction Recognition in Videos”(在Computational Linguistics,2014)中通过把RGB图像序列提取光流序列，分别用卷积神经网络训练并把两个网络得到的特征进行融合来实现对动作的识别效果。虽然上面的这些模型取得了很好的效果，但是这种方法只能实现对整个视频进行识别，不能定位动作的时空位置。With the popularization of monitoring equipment, the detection of human motion in surveillance video has gradually become a popular research field, Wang L., Qiao Y., Tang X. "Action recognition with trajectory-pooled deep convolutional descriptors." (in 2015IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).) In the method, the video features are extracted by integrating the deep neural network and the features obtained by using the dense tracking algorithm. To achieve action recognition for the entire video, D.Tran, L.Bourdev, R.Fergus, L.Torresani, and M.Paluri. "Learning spatiotemporal features with 3dconvolutional networks." (at 2015IEEE International Conference on ComputerVision (ICCV) (2015)) proposed to use 3D convolution and 3D pooling to form a C3D framework to extract human action features in videos, Simonyan K, Zisserman A. "Two-Stream Convolutional Networks for Action Recognition in Videos" (in Computational Linguistics, 2014 ) by extracting the optical flow sequence from the RGB image sequence, training it with a convolutional neural network and fusing the features obtained by the two networks to achieve the recognition effect of the action. Although the above models have achieved good results, this method can only realize the recognition of the entire video, and cannot locate the spatio-temporal position of the action.

G.Gkioxari and J.Malik.的“Finding action tubes”(在IEEE Int.Conf.onComputer Vision and Pattern Recognition,2015.)中主要是检测每一帧的动作proposals然后再连接每一帧的动作proposal形成动作序列，J.Lu,r.Xu,and J.J.Corso的”Human action segmentation with hierarchical supervoxel consistency”(在IEEEInt.Conf.on Computer Vision and Pattern Recognition,June 2015)中提出了一种层次化的MRF模型，以将具有高层次人体运动和表观的低级视频片段连接起来以实现在视频中对动作的分割，这些方法主要实现了对视频中的动作进行空间的分割，并且这些算法需要大量的帧级别的region proposals需要大量的计算。G.Gkioxari and J.Malik. "Finding action tubes" (in IEEE Int.Conf.onComputer Vision and Pattern Recognition, 2015.) mainly detects the action proposals of each frame and then connects the action proposals of each frame to form Action sequence, J.Lu, r.Xu, and J.J.Corso's "Human action segmentation with hierarchical supervoxel consistency" (in IEEEInt.Conf.on Computer Vision and Pattern Recognition, June 2015) proposed a hierarchical MRF model , to connect low-level video clips with high-level human motion and appearance to achieve segmentation of actions in videos, these methods mainly achieve spatial segmentation of actions in videos, and these algorithms require a large number of frame levels The region proposals require a lot of calculations.

Yuan J,Ni B,Yang X的“Temporal Action Localization with Pyramid ofScore Distribution Features”(在IEEE:Computer Vision and PatternRecognition.2016)中基于iDT特征对视频提取了一种分数分布金字塔特征(Pyramid ofScore Distribution Feature,PSDF)，之后再使用了LSTM网络对PSDF特征序列进行处理，并根据输出的frame-level的行为类别置信度分数处理得到行为片段的预测。Shou Z,WangD,Chang S F.的”Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs”(在IEEE Conference on Computer Vision and Pattern Recognition(CVPR)(2016))中首先使用滑窗的方法生成多种尺寸的视频片段(segment)，再使用多阶段的网络(Segment-CNN)来处理，最后采用了非极大化抑制来去除重叠的片段，完成预测。Shou Z,Chan J,Zareian A,的”CDC:Convolutional-De-Convolutional Networks forPrecise Temporal Action Localization in Untrimmed Videos”(在2017IEEEConference on Computer Vision and Pattern Recognition(CVPR)(2017))中基于C3D(3D CNN网络)设计了一个卷积逆卷积网络(CDC)，输入一小段视频，输出帧级别的动作类别概率。该网络主要是用来对temporal action detection中的动作边界进行微调，使得动作边界更加准确，上面的框架虽然能达到实时的效果但是，上面的算法主要是实现动作在时间维度精确地检测，而不能实现动作的时空检测。In Yuan J, Ni B, Yang X's "Temporal Action Localization with Pyramid of Score Distribution Features" (in IEEE: Computer Vision and Pattern Recognition.2016), a score distribution pyramid feature (Pyramid of Score Distribution Feature, Pyramid of Score Distribution Feature, PSDF), and then use the LSTM network to process the PSDF feature sequence, and process the prediction of the behavior segment according to the output frame-level behavior category confidence score. Shou Z, WangD, Chang S F. "Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs" (in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)) first used the sliding window method to generate multiple The size of the video segment (segment), and then use a multi-stage network (Segment-CNN) to process, and finally use non-maximization suppression to remove overlapping segments and complete the prediction. Shou Z, Chan J, Zareian A, "CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos" (in 2017IEEEConference on Computer Vision and Pattern Recognition (CVPR) (2017)) based on C3D (3D CNN network ) designed a Convolutional Deconvolutional Network (CDC) that inputs a short video and outputs frame-level action category probabilities. The network is mainly used to fine-tune the action boundary in temporal action detection to make the action boundary more accurate. Although the above framework can achieve real-time effects, the above algorithm is mainly to achieve accurate detection of actions in the time dimension, and cannot Realize the spatio-temporal detection of actions.

J.C.van Gemert,M.Jain,E.Gati,and C.G.Snoek.的”APT:Action localizationproposals from dense trajectories”(在BMVC,volume 2,page 4,2015)中使用无监督聚类来生成一组边界框式的时空动作提议。由于该方法基于密集轨迹特征，因此该方法无法检测以小运动为特征的动作。P.Weinzaepfel,Z.Harchaoui,and C.Schmid.的”Learningto track for spatio-temporal action localization”(IEEE Computer Vision andPattern Recognition,2015.)通过将帧级EdgeBoxes区域提议与跟踪检测框架相结合来执行动作的时空检测。然而，动作的时间维度的检测仍然通过每个轨道上的多尺度滑动窗口来实现，使得对于较长的视频序列该方法效率低下。J.C. van Gemert, M. Jain, E. Gati, and C.G. Snoek. "APT: Action localization proposals from dense trajectories" (in BMVC, volume 2, page 4, 2015) uses unsupervised clustering to generate a set of bounding boxes Proposals for space-time actions. Since the method is based on dense trajectory features, this method cannot detect actions characterized by small movements. "Learning to track for spatio-temporal action localization" by P. Weinzaepfel, Z. Harchaoui, and C. Schmid. (IEEE Computer Vision and Pattern Recognition, 2015.) performs actions by combining frame-level EdgeBoxes region proposals with a track detection framework space-time detection. However, the detection of the temporal dimension of actions is still achieved by a multi-scale sliding window on each track, making this method inefficient for longer video sequences.

发明内容Contents of the invention

本发明针对现有的动作检测存在的一些问题，提出一种基于监控视频的多区域实时动作检测方法。本发明采用的技术手段如下：Aiming at some problems existing in the existing motion detection, the present invention proposes a monitoring video-based multi-region real-time motion detection method. The technical means adopted in the present invention are as follows:

一种基于监控视频的多区域实时动作检测方法，其特征在于具有如下步骤：A multi-regional real-time action detection method based on surveillance video, characterized in that it has the following steps:

模型训练阶段：Model training phase:

A1、获取训练数据：标注好的特定动作的数据库；A1. Obtain training data: a database of marked specific actions;

A2、计算训练数据中的视频序列的稠密光流，获取训练数据中的视频序列的光流序列，并对光流序列中的光流图像进行标注；A2. Calculate the dense optical flow of the video sequence in the training data, obtain the optical flow sequence of the video sequence in the training data, and mark the optical flow image in the optical flow sequence;

A3、利用训练数据中的视频序列和光流序列分别训练目标检测模型yolo v3，分别得到RGB yolo v3模型和光流yolo v3模型；A3. Use the video sequence and optical flow sequence in the training data to train the target detection model yolo v3 respectively, and obtain the RGB yolo v3 model and the optical flow yolo v3 model respectively;

测试阶段：Testing phase:

B1、通过金字塔Lucas-Kanande光流法提取视频的稀疏光流图像序列，然后把视频的RGB图像序列和稀疏光流图像序列分别送入RGB yolo v3模型和光流yolo v3模型中，RGByolo v3模型输出的一系列检测框使用非极大值抑制方法提取所有动作类别的前n个检测框i＝1…n，每个检测框有一个动作类别的标签和属于该动作的一个概率分数光流yolo v3模型输出的一系列检测框使用非极大值抑制方法提取所有动作类别的前n个检测框k＝1…n，每个检测框有一个动作类别的标签和属于该动作的一个概率分数s_光分别遍历RGB yolo v3模型和光流yolo v3模型输出的检测框，每个RGB yolo v3模型输出的检测框与光流yolo v3模型输出的相同动作类别的检测框做交并比，并把最大的交并比对应的光流yolo v3模型输出的同动作类别的检测框设为若最大的交并比大于阈值K，则把对应的两个RGB yolo v3模型和光流yolov3模型输出的检测框的概率分数融合为作为该RGB yolo v3模型输出的检测框的置信度，满足以下公式：B1. Extract the sparse optical flow image sequence of the video through the pyramidal Lucas-Kanande optical flow method, and then send the RGB image sequence and sparse optical flow image sequence of the video into the RGB yolo v3 model and the optical flow yolo v3 model respectively, and the RGByolo v3 model outputs A sequence of detection boxes for Extract the top n detection boxes for all action categories using the non-maximum suppression method i=1...n, each detection box has a label of an action category and a probability score belonging to that action A series of detection frames output by the optical flow yolo v3 model uses the non-maximum value suppression method to extract the first n detection frames of all action categories k = 1...n, each detection box has a label of an action category and a probability score s that belongs to _the action Traverse the detection frame output by the RGB yolo v3 model and the optical flow yolo v3 model respectively, and the detection frame output by each RGB yolo v3 model The detection frame of the same action category output by the optical flow yolo v3 model Do the intersection and comparison, and set the detection frame of the same action category output by the optical flow yolo v3 model corresponding to the largest intersection ratio to If the maximum intersection and union ratio is greater than the threshold K, then the probability scores of the detection frames output by the corresponding two RGB yolo v3 models and the optical flow yolov3 model are fused as As the detection box output by this RGB yolo v3 model confidence level, satisfy the following formula:

其中，表示和的交并比，为与交并比最大的同动作类别的概率分数；in, express and The intersection ratio, for with The same action category with the largest cross-merge ratio probability score;

B2、根据融合得到的每个RGB yolo v3模型输出的检测框的每个动作类别的置信度分数，连接视频的RGB图像序列之间的检测框形成tube：B2. According to the confidence score of each action category of the detection frame output by each RGB yolo v3 model obtained through fusion, connect the detection frames between the RGB image sequences of the video to form a tube:

对tube进行初始化，使用视频的RGB图像序列中的第一帧图像的检测框进行初始化tube，例如视频的RGB图像序列中的第一帧图像产生了n个检测框，则初始n个tube，视频的RGB图像序列中的第一帧图像的某一动作类别的tube个数为：Initialize the tube and use the detection frame of the first frame image in the RGB image sequence of the video to initialize the tube. For example, if the first frame image in the RGB image sequence of the video generates n detection frames, then the initial n tubes, video The number of tubes of a certain action category in the first frame image in the RGB image sequence is:

n_类别(1)＝n； _ncategory (1) = n;

分别对所有的动作类别进行以下操作：Do the following for all action categories separately:

S1、匹配每个tube和t帧产生的检测框，首先遍历属于同动作类别的tube，若该动作类别有n个tube，对每个tube求该tube每帧的置信度的平均值，作为该tube的值，并对该动作类别的n个tube的值进行降序排列形成列表list_类别，确定每个tube的动作类别时，定义了一个列表I＝{l_t-k+1…l_t}用来确定tube的动作类别，列表I＝{l_t-k+1…l_t}用来存储tube的后k帧的动作类别；S1. Match the detection frame generated by each tube and t frame, first traverse the tubes belonging to the same action category, if there are n tubes in the action category, calculate the average of the confidence of each frame of the tube for each tube, as the The value of the tube, and the values of the n tubes of the action category are arranged in descending order to form a list list _category . When determining the action category of each tube, a list I={l _t-k+1 ... l _t } is defined To determine the action category of the tube, the list I={l _t-k+1 ... l _t } is used to store the action category of the last k frames of the tube;

S2、遍历列表list_类别和t帧中的i＝1…n，从中选择满足以下条件的添加到tube中：S2, traverse the list list _category and the t frame i=1...n, choose the ones that meet the following conditions Add to tube:

遍历列表list_类别中的tube，并选择t帧中和tube同动作类别的进行匹配，如果该与tube的最后一帧图像中的检测框的交并比大于阈值d，则把该加入到队列H_list_类别中；Traverse the tubes in the list _category , and select the same action category as the tube in the t frame to match, if the If the intersection ratio with the detection frame in the last frame image of the tube is greater than the threshold d, then the Add to the queue H_list _category ;

如果则挑选H_list_类别中置信度最高的加入到tube中，并在再次遍历t帧的i＝1…n时，剔除置信度最高的 if Then select the one with the highest confidence in the H_list _category Add to the tube, and traverse the t frame again When i=1...n, remove the one with the highest confidence

如果则该tube不加入任何的并保持不变，如果连续k帧tube都没加入新的则终止该tube；if then the tube does not add any And remain unchanged, if no new tube is added for continuous k frames then terminate the tube;

如果t帧有未被匹配的记为则遍历所有的tube分别求和所有的tube最后一帧的交并比，并选取交并比大于阈值k，并且交并比最大的tube，记为T^*，把加入到该tube中，T^*满足以下公式：If t frame has unmatched recorded as Then traverse all the tubes to find And the intersection and union ratio of the last frame of all tubes, and select the tube with the intersection ratio greater than the threshold k and the largest intersection ratio, denoted as T ^* , and put Added to the tube, T ^* satisfies the following formula:

如果则如果则Tⁱ为第i个tube，Tⁱ(t-1)为第i个tube的第t-1帧；if but if but T ⁱ is the i-th tube, T ⁱ (t-1) is the t-1th frame of the i-th tube;

如果第t帧中仍有未被匹配的检测框，则以该检测框为起点，生成新的tube，并用该检测框作为该tube的第一帧图像来初始化tube；If there is still an unmatched detection frame in the tth frame, use the detection frame as the starting point to generate a new tube, and use the detection frame as the first frame image of the tube to initialize the tube;

S3、所有的tube匹配完后，更新每个tube的后k帧的动作类别列表I＝{l_t-k+1…l_t}，其中l_t为tube的第t帧的动作类别，更新每个tube的动作类别L，统计每个tube的后k帧的动作类别I＝{l_t-k+1…l_t}，其中最多的动作类别作为该tube的动作类别L，满足以下公式：S3, all tubes are matched Afterwards, update the action category list I={l _t-k+1 ... l _t } of the last k frames of each tube, wherein l _t is the action category of the tth frame of the tube, update the action category L of each tube, Count the action categories I={l _t-k+1 ...l _t } of the last k frames of each tube, and the action category with the largest number is taken as the action category L of the tube, which satisfies the following formula:

如果l_i＝c，则g(l_i,c)＝1；如果l_i≠c，则g(l_i,c)＝0，c为某一动作类别，即统计I＝{l_t-k+1…l_t}中的动作类别，个数最多的动作类别即为该tube的动作类别。If l _i =c, then g(l _i ,c)=1; if l _i ≠c, then g(l _i ,c)=0, c is a certain action category, that is, statistics I={l _{t-k +1} … l _t } in the action category, the action category with the largest number is the action category of the tube.

所述步骤A1中，标注好的特定动作的数据库为UCF101的Action Detection数据集。In the step A1, the marked specific action database is the Action Detection data set of UCF101.

所述步骤A2中，使用OpenCV库中的calcOpticalFlowFarneback函数计算训练数据中的视频序列的稠密光流。In the step A2, the dense optical flow of the video sequence in the training data is calculated using the calcOpticalFlowFarneback function in the OpenCV library.

与现有技术相比，本发明不仅能实现对监控视频中特定动作的时空位置检测，并且能实现对监控的实时处理。Compared with the prior art, the present invention can not only realize the time-space position detection of the specific action in the surveillance video, but also realize the real-time processing of the surveillance.

基于上述理由本发明可在计算机视觉等领域广泛推广。Based on the above reasons, the present invention can be widely promoted in fields such as computer vision.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图做以简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1是本发明的具体实施方式中交并比计算示意图。Fig. 1 is a schematic diagram of cross-union ratio calculation in a specific embodiment of the present invention.

图2是本发明的具体实施方式中基于监控视频的多区域实时动作检测方法的整体示意图。FIG. 2 is an overall schematic diagram of a multi-region real-time motion detection method based on surveillance video in a specific embodiment of the present invention.

图3是本发明的具体实施方式中基于监控视频的多区域实时动作检测方法程序流程图。Fig. 3 is a program flow chart of a multi-region real-time motion detection method based on surveillance video in a specific embodiment of the present invention.

图4是本发明的具体实施方式中某一帧图像的处理过程示意图。Fig. 4 is a schematic diagram of a processing process of a frame of image in a specific embodiment of the present invention.

图5是本发明的具体实施方式中连续图像序列的处理过程示意图。Fig. 5 is a schematic diagram of a processing process of a continuous image sequence in a specific embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

如图1-图5所示，一种基于监控视频的多区域实时动作检测方法，具有如下步骤：As shown in Figures 1-5, a multi-region real-time motion detection method based on surveillance video has the following steps:

模型训练阶段：Model training phase:

测试阶段：Testing phase:

其中，表示和的交并比，为与交并比最大的同动作类别的概率分数，表示概率分数，类如，图像A和B的交并比IOU(A,B)可如图1所示，in, express and The intersection ratio, for with The same action category with the largest cross-merge ratio probability score, Indicates the probability score, such as, the intersection and union ratio IOU(A,B) of images A and B can be shown in Figure 1,

其中area(A)表示为图像A的面积，area(A)∩area(B)为图像相交的面积。Where area(A) represents the area of image A, and area(A)∩area(B) is the area where the images intersect.

n_类别(1)＝n； _ncategory (1) = n;

S1、匹配每个tube和t帧产生的检测框，首先遍历属于同动作类别的tube，若该动作类别有n个tube，对每个tube求该tube每帧的置信度的平均值，作为该tube的值，并对该动作类别的n个tube的值进行降序排列形成列表list_类别，确定每个tube的动作类别时，定义了一个列表I＝{l_t-k+1…l_t}用来确定tube的动作类别，列表I＝{l_t-k+1…l_t}用来存储tube的后k帧的动作类别；S1. Match the detection frame generated by each tube and t frame, first traverse the tubes belonging to the same action category, if there are n tubes in the action category, calculate the average of the confidence of each frame of the tube for each tube, as the The value of the tube, and the values of the n tubes of the action category are sorted in descending order to form a list _category , When determining the action category of each tube, a list I={l _t-k+1 ...l _t } is defined to determine the action category of the tube, and the list I={l _t-k+1 ...l _t } is used to Store the action category of the last k frames of the tube;

图2中(a)表示视频的RGB图像序列；(b)表示光流算法测试阶段采用OpenCV中金字塔Lucas-Kanande光流法进行提取稀疏光流图像，训练阶段为提取稠密光流图像；(c)为得到的稀疏光流图像；(d)为动作检测模型，一个为使用视频的RGB图像序列训练的RGB yolov3模型，另一个为用光流序列训练的光流yolo v3模型；(e)表示RGB yolo v3模型输出的检测结果；(f)表示光流yolo v3模型的检测结果；(g)表示融合两个模型输出的结果，得到具有更好鲁棒性的特征；(h)表示利用融合得到的特征把视频的RGB图像序列之间的检测框连接为tube。In Figure 2, (a) shows the RGB image sequence of the video; (b) shows that the optical flow algorithm uses the pyramidal Lucas-Kanande optical flow method in OpenCV to extract sparse optical flow images in the testing stage, and extracts dense optical flow images in the training stage; (c ) is the obtained sparse optical flow image; (d) is the action detection model, one is the RGB yolov3 model trained using the RGB image sequence of the video, and the other is the optical flow yolo v3 model trained with the optical flow sequence; (e) represents The detection result output by the RGB yolo v3 model; (f) represents the detection result of the optical flow yolo v3 model; (g) represents the result of fusing the output of the two models to obtain features with better robustness; (h) represents the use of fusion The resulting features connect the detection boxes between the RGB image sequences of the video as tubes.

图4(a)为视频的RGB图像序列中的图像；(b)表示视频的RGB图像序列中的图像对应的光流图像；(c)表示视频的RGB图像序列中的图像经过RGB yolo v3模型处理后输出的检测结果；(d)表示光流图像经过光流yolo v3模型处理后输出的检测结果；Figure 4(a) is the image in the RGB image sequence of the video; (b) represents the optical flow image corresponding to the image in the RGB image sequence of the video; (c) represents the image in the RGB image sequence of the video through the RGB yolo v3 model The detection result output after processing; (d) represents the detection result output after the optical flow image is processed by the optical flow yolo v3 model;

图5视频中的连续图像序列；(a)表示等间距取视频的RGB图像序列中的图像；(b)表示视频的RGB图像序列中的图像对应的光流序列；(c)表示视频的RGB图像序列中的图像经过RGB yolo v3模型处理后输出的检测结果；(d)表示光流序列经过光流yolo v3模型处理后输出的检测结果；(e)表示经过融合(c)和(d)的检测结果得到的tube；The continuous image sequence in the video in Figure 5; (a) represents the image in the RGB image sequence of the video at equal intervals; (b) represents the optical flow sequence corresponding to the image in the RGB image sequence of the video; (c) represents the RGB image sequence of the video The output detection results of the images in the image sequence processed by the RGB yolo v3 model; (d) indicates the output detection results of the optical flow sequence processed by the optical flow yolo v3 model; (e) indicates the fusion of (c) and (d) The tube obtained from the test results;

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.

Claims

1. A multi-region real-time action detection method based on a surveillance video is characterized by comprising the following steps:

a model training stage:

a1, acquiring training data: a database of labeled specific actions;

a2, calculating dense optical flows of video sequences in training data, acquiring optical flow sequences of the video sequences in the training data, and labeling optical flow images in the optical flow sequences;

a3, respectively training a target detection model yolo v3 by utilizing a video sequence and an optical flow sequence in training data to respectively obtain an RGB yolo v3 model and an optical flow yolo v3 model;

and (3) a testing stage:

b1, extracting a sparse optical flow image sequence of the video by a pyramid Lucas-Kanande optical flow method, then respectively sending the RGB image sequence and the sparse optical flow image sequence of the video into an RGB yolo v3 model and an optical flow yolo v3 model, and extracting the first n detection frames of all action categories by using a non-maximum suppression method through a series of detection frames output by the RGByolo v3 modelEach detection box has a label of an action category and a probability score belonging to the actionA series of detection frames output by the optical flow yolo v3 model uses a non-maximum suppression method to extract the first n detection frames of all action classes Each detection box has a label of an action category and a probability score belonging to the action Respectively traversing the detection frames output by the RGB yolo v3 model and the optical flow yolo v3 model, and the detection frame output by each RGB yolo v3 modelDetection frame of the same action category as that output by optical flow yolo v3 modelMaking a cross-over ratio, and setting a detection frame of the same action type output by the optical flow yolo v3 model corresponding to the maximum cross-over ratio as a detection frame of the same action typeIf the maximum intersection ratio is larger than the threshold value K, fusing the probability scores of the detection frames output by the corresponding two RGB yolo v3 models and the optical flow yolo v3 models intoDetection frame output as the RGB yolo v3 modelThe degree of confidence of (a) is,the following formula is satisfied:

wherein,to representAndthe cross-over-cross-over ratio of (c),is prepared by reacting withOf the same action class with the largest cross-over ratioA probability score;

b2, connecting the detection frames between the RGB image sequences of the video to form a tube according to the confidence score of each action type of the detection frame output by each RGB yolo v3 model obtained by fusion:

initializing a tube, and initializing the tube by using a detection frame of a first frame image in an RGB image sequence of a video;

the following operations are performed for all action categories, respectively:

s1, matching each tube and the detection frame generated by t frame, firstly traversing tube belonging to the same action type, if there are n tubes in the action type, calculating the average value of confidence of each frame of the tube for each tube as the value of the tube, and arranging the values of the n tubes in descending order to form list_CategoriesWhen determining the action category of each tube, a list I ═ l is defined_t-k+1…l_tUsed to determine the action category of tube, list I ═ l_t-k+1…l_tThe action category of the k frame after tube is stored;

s2, traversing list_CategoriesAnd in t framesFrom which one satisfying the following conditions is selectedAddition to tube:

traverse list_CategoriesAnd selects t frames and tube of the same action categoryPerforming a match ifIf the intersection ratio with the detection frame in the last frame image of tube is larger than the threshold value d, the detection frame is processedAdd to queue H _ list_CategoriesPerforming the following steps;

if it is notPick H _ list_CategoriesWith highest confidence level in the middleAdded to the tube and traversed the t frame againThen, the one with the highest confidence coefficient is eliminated

If it is notThen the tube does not add anythingAnd remains unchanged if no new frame tube is added for consecutive k framesTerminating the tube;

if t frames have not been matchedIs marked asThen go through all tube to find outAnd the cross-over ratio of the last frame of all tube is selectedTube greater than threshold k and having the largest cross-over ratio is denoted as T^*Handle barAdded to the tube, T^*The following formula is satisfied:

if it is notThen If it is notThenTⁱIs the ith tube, Tⁱ(t-1) the t-1 th frame of the ith tube;

if the t-th frame still has the detection frame which is not matched, generating a new tube by taking the detection frame as a starting point, and initializing the tube by taking the detection frame as a first frame image of the tube;

s3, matching all tubeThen, the action category list I of the k-frame after each tube is updated to { l ═ l_t-k+1…l_tIn which l_tFor the action type of t-th frame of tube, update action type L of each tube, and count action type I of k-th frame of each tube as { L }_t-k+1…l_tAmong them, the most motion class is the motion class L of the tube, and the following formula is satisfiedFormula (II):

if l is_iC, then g (l)_iC) 1; if l is_iNot equal to c, then g (l)_iC) is 0, c is a certain action category, i.e. the statistic I is { l ═_t-k+1…l_tThe action type with the largest number is the action type of the tube.

2. The multi-region real-time action detection method based on surveillance video according to claim 1, characterized in that: in step a1, the database of the labeled specific Action is the Action Detection data set of UCF 101.

3. The multi-region real-time action detection method based on surveillance video according to claim 1, characterized in that: in the step a2, a dense optical flow of the video sequence in the training data is calculated by using a calcoptical flow farneback function in the OpenCV library.