WO2022007193A1 - 一种基于迭代学习的弱监督视频行为检测方法及系统 - Google Patents

一种基于迭代学习的弱监督视频行为检测方法及系统 Download PDF

Info

Publication number
WO2022007193A1
WO2022007193A1 PCT/CN2020/115542 CN2020115542W WO2022007193A1 WO 2022007193 A1 WO2022007193 A1 WO 2022007193A1 CN 2020115542 W CN2020115542 W CN 2020115542W WO 2022007193 A1 WO2022007193 A1 WO 2022007193A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
network model
video
output
activation sequence
Prior art date
Application number
PCT/CN2020/115542
Other languages
English (en)
French (fr)
Inventor
宋砚
邹荣
舒祥波
Original Assignee
南京理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京理工大学 filed Critical 南京理工大学
Priority to US17/425,653 priority Critical patent/US11721130B2/en
Publication of WO2022007193A1 publication Critical patent/WO2022007193A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the invention relates to the technical field of behavior detection, in particular to a weakly supervised video behavior detection method and system based on iterative learning.
  • behavior recognition has been widely studied in the field of computer vision. Its purpose is to automatically analyze the collected video and identify the behavior category of the action, so as to replace the human eye to complete the analysis and judgment of the action.
  • Video behavior recognition is widely used in various video occasions, such as intelligent monitoring, human-computer interaction, motion analysis, and virtual reality. Behavior detection is developed from behavior recognition. Behavior recognition is mainly for segmented action videos, while behavior detection is mainly for unsegmented action videos, which is more suitable for long videos shot in real life. The goal of action detection is to find out the start time, end time of each action and identify the category of the action in a long unsegmented video.
  • the weakly supervised time-series behavior detection can locate all the actions in the video from the beginning to the end of the frame, and identify the categories of these actions when only knowing which actions the video contains.
  • the time series behavior detection based on weak supervision has a wider application prospect and practical value in reality.
  • most weakly supervised time-series behavior detection methods are based on deep convolutional neural network for feature extraction of video, and use multi-instance learning or attention mechanism to find the high score of action response in the class activation sequence to classify video actions, and at the same time Action structures, action features, or the relationship between actions and backgrounds build a localization network to learn to update the class activation sequence, and finally localize according to the class activation sequence.
  • Action structures, action features, or the relationship between actions and backgrounds build a localization network to learn to update the class activation sequence, and finally localize according to the class activation sequence.
  • the purpose of the present invention is to provide a weakly supervised video behavior detection method and system based on iterative learning, which can accurately locate and detect the actions in the video.
  • the present invention provides the following scheme:
  • a weakly supervised video behavior detection method based on iterative learning including:
  • a neural network model group is constructed, and the neural network model group includes at least two neural network models; the input of each neural network model is the spatiotemporal feature of the training set, and the output of each neural network model is all the neural network models.
  • the first neural network model is trained according to the real category label of the video, the class activation sequence output by the first neural network model and the video features output by the first neural network model;
  • the first neural network model is the neural network model group The first neural network model in ;
  • the time series pseudo-label output by the current neural network model the class activation sequence output by the next neural network model, and the video features output by the next neural network model to train the next described neural network model;
  • the spatiotemporal features of the test set are input into each of the neural network models, and each corresponding test video in the test set is subjected to motion detection according to the class activation sequence output by each of the neural network models, to obtain each of the neural network models.
  • the detection accuracy of the network model
  • Action detection is performed on the video to be detected according to the neural network model corresponding to the highest detection accuracy value.
  • the training of the first neural network model according to the real category label of the video, the class activation sequence output by the first neural network model, and the video features output by the first neural network model specifically:
  • Parameters of the first neural network model are updated according to the classification loss and the similarity loss.
  • the time series pseudo-label output by the current neural network model, the class activation sequence output by the next neural network model, and the video output by the next neural network model The feature trains the next described neural network model, specifically:
  • the parameters of the next neural network model are updated according to the classification loss, the similarity loss and the timing loss.
  • the extracting the spatiotemporal features of the video including the action behavior is specifically: extracting the spatiotemporal feature of the video including the action behavior according to the pre-trained network model I3D.
  • the first neural network model includes a fully connected layer with N nodes, a linear rectification layer, a random deactivation layer, and a fully connected layer with C nodes; where N is the spatiotemporal feature of the training set.
  • the feature dimension of each segment after the video frame is divided into segments, C is the total number of categories of all videos in the training set.
  • performing motion detection on the video to be detected according to the neural network model corresponding to the highest detection accuracy value is specifically:
  • the classification score select a predicted category including the action to be detected in the video to be detected
  • a candidate action segment including the action to be detected is selected.
  • a weakly supervised video behavior detection system based on iterative learning including:
  • the spatiotemporal feature extraction module is used to extract the spatiotemporal features of the video containing the action behavior; the spatiotemporal features are divided into training set spatiotemporal features and test set spatiotemporal features;
  • a neural network model group building module is used to construct a neural network model group, the neural network model group includes at least two neural network models; the input of each neural network model is the spatiotemporal feature of the training set, and each The output of the neural network model is the class activation sequence, time series pseudo-label and video feature corresponding to the spatiotemporal feature of the training set in the neural network model;
  • the first training module is used to train the first neural network model according to the real category label of the video, the class activation sequence output by the first neural network model, and the video features output by the first neural network model; the first neural network model be the first neural network model in the neural network model group;
  • the iterative training module is used for according to the real category label of the video, the time series pseudo-label output by the current neural network model, the class activation sequence output by the next neural network model, and the video output by the next neural network model feature training the next described neural network model;
  • the accuracy detection module is used to input the spatiotemporal features of the test set into each of the neural network models, and perform motion detection on each test video corresponding to the test set according to the class activation sequence output by each of the neural network models. , to obtain the detection accuracy of each of the neural network models;
  • An action detection module configured to perform action detection on the video to be detected according to the neural network model corresponding to the highest detection accuracy value.
  • the first training module includes:
  • a loss calculation unit configured to calculate the classification loss of the video and the video according to the real category label of the video, the class activation sequence output by the first neural network model, and the video features output by the first neural network model similarity loss;
  • a first updating unit configured to update parameters of the first neural network model according to the classification loss and the similarity loss.
  • the iterative training module includes:
  • a classification loss calculation unit configured to calculate the classification loss of the video according to the real class label of the video and the class activation sequence output by the next neural network model
  • a fusion feature calculation unit used for calculating fusion features according to the time series pseudo-label output by the current neural network model and the video feature output by the next neural network model;
  • a similarity loss calculation unit configured to calculate the similarity loss of the video according to the fusion feature of the video and the class activation sequence output by the next neural network model
  • a timing loss calculation unit configured to calculate the timing loss of the video according to the timing pseudo-label output by the current neural network model and the class activation sequence output by the next neural network model;
  • the second updating unit is configured to update the parameters of the next neural network model according to the classification loss, the similarity loss and the time series loss.
  • the motion detection module includes:
  • a feature extraction unit for extracting spatiotemporal features of the video to be detected
  • a class activation sequence output unit configured to input the spatiotemporal features of the video to be detected into the neural network model corresponding to the highest value of the positioning accuracy, and output a class activation sequence
  • a classification score obtaining unit configured to obtain the classification score of the video to be detected according to the class activation sequence
  • a predicted category selection unit configured to select a predicted category including an action to be detected in the video to be detected according to the classification score
  • an activation sequence selection unit configured to select an activation sequence corresponding to the predicted category in the class activation sequence
  • a candidate action segment selection unit configured to select a candidate action segment including an action to be detected according to the activation sequence.
  • the present invention discloses the following technical effects:
  • the invention provides a weakly supervised video behavior detection method and system based on iterative learning, including: extracting spatiotemporal features of videos including action behaviors; building a neural network model group;
  • the first neural network model is trained according to the class activation sequence of the first neural network model and the video features output by the first neural network model; according to the real category label of the video, the time series pseudo label output by the current neural network model, the class activation sequence output by the next neural network model and the next
  • the video features output by one neural network model train the next neural network model; the spatiotemporal features of the test set are input into each neural network model, and each corresponding test video in the test set is acted according to the class activation sequence output by each neural network model.
  • the detection accuracy of each neural network model is obtained.
  • Action detection is performed on the video to be detected according to the neural network model corresponding to the highest detection accuracy value.
  • the next neural network model is trained according to the time series pseudo-label information output by the current neural network model, which can make the class activation sequence learned by the neural network model more accurate, so that the actions in the video can be accurately detected.
  • FIG. 1 is a flowchart of an iterative learning-based weakly supervised video behavior detection method provided by an embodiment of the present invention
  • FIG. 2 is a process diagram of a weakly supervised video behavior detection method based on iterative learning provided by an embodiment of the present invention
  • FIG. 3 is a process diagram of a fusion feature acquisition provided by an embodiment of the present invention.
  • FIG. 4 is a process diagram of a time series pseudo-label output provided by an embodiment of the present invention.
  • Fig. 5 is a timing loss calculation process diagram provided by an embodiment of the present invention.
  • FIG. 6 is a system block diagram of a weakly supervised video behavior detection system based on iterative learning provided by an embodiment of the present invention.
  • the purpose of the present invention is to provide a weakly supervised video behavior detection method and system based on iterative learning, which can accurately locate and detect the actions in the video.
  • FIG. 1 is a flowchart of an iterative learning-based weakly supervised video behavior detection method provided by an embodiment of the present invention.
  • FIG. 2 is a process diagram of a weakly supervised video behavior detection method based on iterative learning provided by an embodiment of the present invention. As shown in Figures 1 and 2, the method includes:
  • Step 101 Extract spatiotemporal features of videos containing action behaviors.
  • the spatiotemporal features are divided into training set spatiotemporal features and test set spatiotemporal features.
  • it is specifically: for a given video v, first extract the image frame and optical flow of the video v, and then use the I3D model pre-trained on the kinetics dataset to extract the spatiotemporal features of the video from the image frame and optical flow
  • T v is the number of segments into which all frames of video v are divided
  • N is the feature dimension of each segment
  • N 2048.
  • Step 102 Construct a neural network model group, the neural network model group includes at least two neural network models; the input of each neural network model is the spatiotemporal feature of the training set, and the output of each neural network model is All are the class activation sequences, time series pseudo-labels and video features corresponding to the neural network model of the spatiotemporal features of the training set.
  • Step 103 Train the first neural network model according to the real category label of the video, the class activation sequence output by the first neural network model, and the video features output by the first neural network model; the first neural network model is the neural network model.
  • the first neural network model includes a fully connected layer with N nodes, a linear rectification layer, a random deactivation layer, and a fully connected layer with C nodes, where N is the video in the spatiotemporal features of the training set The feature dimension of each segment after the frame is divided into segments, C is the total number of categories of all videos in the training set.
  • the process of training the first neural network model includes:
  • Step 1031 Calculate the classification loss of the video and the similarity of the video according to the real category label of the video, the class activation sequence output by the first neural network model, and the video features output by the first neural network model loss.
  • step 1031 specifically includes the following steps:
  • Step 10311 the training set a corresponding video v S v inputted to the temporal features one full node 2048 connecting layer, active layer, and a random linear rectifying deactivation layer, to obtain video feature associated with the detection task
  • C is the total number of action categories for all videos in the training set.
  • L class is the classification loss for all videos in the training set
  • B is the number of batches
  • Step 10312 Activate the sequence according to the class of the video Find the activation sequence corresponding to the real category j of the video Then use the softmax function to get the known attention weights of the class Calculated as follows:
  • the high-weight feature value H j (m) and the low-weight feature value L j (m) of the video m are calculated using formula (4) and formula (5), respectively.
  • high-weight feature value H j (n) and low-weight feature value L j (n) of video n are calculated using formula (4) and formula (5), respectively.
  • the cosine similarity is used to measure the similarity of two eigenvalues X j (m) and X j (n). i.e. similarity between H j (m) and H j (n) D H [m,n], similarity between H j (m) and L j (n) D L [m,n], H Similarity DL [n,m] between j (n) and Lj (m).
  • the hinge loss function is used to enlarge the difference between the action and the background, and then the similarity loss of the video is obtained.
  • the specific formula is as follows:
  • L simi is the similarity loss over all videos in the training set.
  • S j is the set of all videos in the training set that contain action j.
  • Step 104 According to the real category label of the video, the time series pseudo-label output by the current neural network model, the class activation sequence output by the next neural network model, and the video feature output by the next neural network model. A model of the neural network.
  • step 104 specifically includes the following steps:
  • Step 1041 Calculate the classification loss of the video according to the real category label of the video and the next class activation sequence output by the neural network model.
  • the calculation process is the same as step 10311.
  • Step 1042 Calculate the fusion feature according to the time series pseudo-label output by the current neural network model and the video feature output by the next neural network model.
  • FIG. 3 is a flowchart of a fusion feature acquisition process provided by an embodiment of the present invention.
  • step 1042 specifically includes the following steps:
  • Step 10421 Class activation sequence output according to the current neural network model The activation sequence corresponding to each segment t in the video v Pick the highest score Score as segment t belongs to the action foreground.
  • a v,r-1 is the class activation sequence output by the first neural network model.
  • Step 10422 Select the segment whose weight is at the top h as the first action segment, wherein For each action segment in the first action segment, calculate its feature similarity with all segments in the video, and the calculation formula is:
  • x m , x n are the features of the video features output by the current neural model at the mth and nth segments.
  • FIG. 4 is a process diagram of a time series pseudo tag output provided by an embodiment of the present invention.
  • Step 10423 Put the time series pseudo labels output by the current neural network model Input to a fully connected layer of 2048 nodes to obtain semantic features that distinguish action and background regions Then combine the semantic features with the video features output by the next neural network model Combined in a certain proportion to obtain the fusion features of the video
  • the fusion formula is:
  • d is the scaling factor, set to 0.1.
  • Step 1043 Calculate the similarity loss of the video according to the fusion feature of the video and the next class activation sequence output by the neural network model.
  • the calculation process is the same as step 10312. (replace the video features in step 10312 with fusion features).
  • Step 1044 Calculate the temporal loss of the video according to the temporal pseudo-label output by the current neural network model and the class activation sequence output by the next neural network model.
  • FIG. 5 is a flowchart of a timing loss calculation process provided by an embodiment of the present invention. In this embodiment, the specific process is:
  • Class-independent weights are obtained according to the class activation sequence A v,r output by the next neural network model The calculation process is the same as step 10421. Then the time series pseudo-labels G v, r-1 and class-independent weights output by the current neural network model Do timing loss. Calculated as follows:
  • Step 1045 Update the parameters of the next neural network model according to the classification loss, the similarity loss and the timing loss. Specifically:
  • Step 105 Input the spatiotemporal features of the test set into each of the neural network models, and perform motion detection on each test video corresponding to the test set according to the class activation sequence output by each of the neural network models, to obtain each of the corresponding test videos in the test set.
  • the detection accuracy of the neural network model is the detection accuracy of the neural network model.
  • Step 106 Perform motion detection on the video to be detected according to the neural network model corresponding to the highest detection accuracy value. Specifically:
  • Extract the spatiotemporal features of the video to be detected input the spatiotemporal features of the video to be detected into the neural network model corresponding to the highest positioning accuracy, output the class activation sequence, and obtain the classification score of the video to be detected according to the class activation sequence.
  • a prediction category including the action to be detected is selected in the video to be detected.
  • a category with a classification score greater than 0 may be selected as the prediction category.
  • a candidate action segment containing the action to be detected is selected as the action detection result.
  • two or more consecutive segments with an activation value greater than a set threshold can be selected as the action detection result, where the threshold is max (a t) - (max ( a t) -min (a t)) * 0.5, a t i to predict the operation corresponding to the activation sequence.
  • FIG. 6 is a system block diagram of a weakly supervised video behavior detection system based on iterative learning provided by an embodiment of the present invention. As shown in FIG. 6 , the system includes:
  • the spatiotemporal feature extraction module 201 is used to extract the spatiotemporal features of the video including the action behavior; the spatiotemporal features are divided into training set spatiotemporal features and test set spatiotemporal features.
  • the neural network model building module 202 is used to construct a neural network model group, the neural network model group includes at least two neural network models; the input of each neural network model is the spatiotemporal feature of the training set, and each The outputs of the neural network model are all class activation sequences, time-series pseudo-labels and video features corresponding to the spatiotemporal features of the training set in the neural network model.
  • the first training module 203 is used to train the first neural network model according to the real category label of the video, the class activation sequence output by the first neural network model, and the video features output by the first neural network model; the first neural network The model is the first neural network model in the neural network model group.
  • the first training module 203 includes:
  • Loss calculation unit 2031 configured to calculate a classification loss of the video and the video similarity loss.
  • a first updating unit 2032 configured to update the parameters of the first neural network model according to the classification loss and the similarity loss.
  • the iterative training module 204 is used for the real category label of the video, the time series pseudo-label output by the current neural network model, the class activation sequence output by the next neural network model, and the output of the next neural network model.
  • the video features train the next described neural network model.
  • the iterative training module 204 includes:
  • the classification loss calculation unit 2041 is configured to calculate the classification loss of the video according to the real class label of the video and the class activation sequence output by the next neural network model.
  • the fusion feature calculation unit 2042 is configured to calculate fusion features according to the time series pseudo labels output by the current neural network model and the video features output by the next neural network model.
  • the similarity loss calculation unit 2043 is configured to calculate the similarity loss of the video according to the fusion feature of the video and the class activation sequence output by the next neural network model.
  • a timing loss calculation unit 2044 configured to calculate the timing loss of the video according to the timing pseudo-label output by the current neural network model and the class activation sequence output by the next neural network model;
  • the second updating unit 2045 is configured to update the parameters of the next neural network model according to the classification loss, the similarity loss and the timing loss.
  • the accuracy detection module 205 is used to input the spatiotemporal features of the test set into each of the neural network models, and perform actions on each corresponding test video in the test set according to the class activation sequence output by each of the neural network models. The detection accuracy of each of the neural network models is obtained.
  • the action detection module 206 is configured to perform action detection on the video to be detected according to the neural network model corresponding to the highest detection accuracy value.
  • the action detection module 206 includes:
  • the feature extraction unit 2061 is used to extract spatiotemporal features of the video to be detected.
  • a class activation sequence output unit 2062 configured to input the spatiotemporal features of the video to be detected into the neural network model corresponding to the highest detection accuracy value, and output a class activation sequence.
  • a classification score obtaining unit 2063 configured to obtain the classification score of the video to be detected according to the class activation sequence.
  • a predicted category selection unit 2064 configured to select a predicted category including an action to be detected in the video to be detected according to the classification score.
  • An activation sequence selection unit 2065 configured to select an activation sequence corresponding to the predicted category from the class activation sequence.
  • the candidate action segment selection unit 2066 is configured to select a candidate action segment including the action to be detected according to the activation sequence.
  • the present invention discloses the following technical effects:
  • the neural network model in the present invention iteratively adds the supervision information of the time series pseudo-label during training, which can make the learned class activation sequence more accurate, thereby making the positioning detection action more accurate.
  • the time series pseudo-labels are converted into semantic features and fused with the video features, so that the video features are more suitable for the positioning task, and the positioning accuracy is further improved.

Abstract

一种基于迭代学习的弱监督视频行为检测方法及系统,包括:提取包含动作行为的视频的时空特征;构建神经网络模型组;根据视频的真实类别标签、第一神经网络模型输出的类激活序列和第一神经网络模型输出的视频特征训练第一神经网络模型;根据视频的真实类别标签、当前神经网络模型输出的时序伪标签、下一个神经网络模型输出的类激活序列和下一个神经网络模型输出的视频特征训练下一个神经网络模型;根据检测精度最高值对应的神经网络模型对待检测视频进行动作检测。根据当前神经网络模型输出的时序伪标签信息训练下一个神经网络模型,可以使神经网络模型学习出的类激活序列更加精准,从而能够准确的检测出视频中的动作。

Description

一种基于迭代学习的弱监督视频行为检测方法及系统
本申请要求于2020年7月07日提交中国专利局、申请号为202010644474.5、发明名称为“一种基于迭代学习的弱监督视频行为检测方法及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及行为检测技术领域,特别是涉及一种基于迭代学习的弱监督视频行为检测方法及系统。
背景技术
近几年来,行为识别在计算机视觉领域已经被广泛研究,其目的是自动分析采集到的视频,从中识别出动作的行为类别,以代替人眼完成动作的分析和判断。视频行为识别广泛应用于各种视频场合,例如智能监控、人机交互、动作分析以及虚拟现实等。行为检测是由行为识别发展而来,行为识别主要针对分割好的动作视频,而行为检测主要针对未分割的动作视频,更适用于现实生活中拍摄的长视频。行为检测的目标就是在一段未分割过的长视频中找出每一个动作的开始时间、结束时间以及识别出动作的类别。由于基于全监督的行为检测需要视频具体的动作时间标注,而人工标注不仅需要花费大量时间还会因人而异。所以,基于弱监督的时序行为检测就可以在只知道视频包含哪些动作的情况下,定位出视频中所有的动作是从第几帧开始到第几帧结束,并识别出这些动作的类别。基于弱监督的时序行为检测相比起单纯的行为识别和基于全监督的时序行为检测在现实中具有更广泛的应用前景和实际价值。
目前,大多数弱监督时序行为检测方法都是基于深度卷积神经网络对视频进行特征提取,利用多示例学习或者注意力机制找到类激活序列中对动作响应高的分数对视频动作分类,同时对动作结构、动作特征或者动作与背景之间的关系构建定位网络去学习更新类激活序列,最后根据类激活序列进行定位。这些方法依然存在一定的问题,没有挖掘类激活序列中潜在包含的定位信息以及语义信息,导致定位准确度偏低。
发明内容
本发明的目的是提供一种基于迭代学习的弱监督视频行为检测方法及系统,能够准确的定位检测出视频中的动作。
为实现上述目的,本发明提供了如下方案:
一种基于迭代学习的弱监督视频行为检测方法,包括:
提取包含动作行为的视频的时空特征;将所述时空特征分为训练集时空特征和测试集时空特征;
构建神经网络模型组,所述神经网络模型组包含至少两个神经网络模型;每个所述神经网络模型的输入均为所述训练集时空特征,每个所述神经网络模型的输出均为所述训练集时空特征在对应所述神经网络模型中的类激活序列、时序伪标签和视频特征;
根据所述视频的真实类别标签、第一神经网络模型输出的类激活序列和第一神经网络模型输出的视频特征训练第一神经网络模型;所述第一神经网络模型为所述神经网络模型组中的第一个神经网络模型;
根据所述视频的真实类别标签、当前所述神经网络模型输出的时序伪标签、下一个所述神经网络模型输出的类激活序列和下一个所述神经网络模型输出的视频特征训练下一个所述神经网络模型;
将所述测试集时空特征输入到各所述神经网络模型中,根据各所述神经网络模型输出的类激活序列分别对所述测试集中对应的每一个测试视频进行动作检测,得到各所述神经网络模型的检测精度;
根据所述检测精度最高值对应的所述神经网络模型对待检测视频进行动作检测。
可选的,所述根据所述视频的真实类别标签、第一神经网络模型输出的类激活序列和第一神经网络模型输出的视频特征训练第一神经网络模型,具体为:
根据所述视频的真实类别标签、所述第一神经网络模型输出的类激活序列和所述第一神经网络模型输出的视频特征计算所述视频的分类损失和所述视 频的相似性损失;
根据所述分类损失和所述相似性损失更新所述第一神经网络模型的参数。
可选的,所述根据所述视频的真实类别标签、当前所述神经网络模型输出的时序伪标签、下一个所述神经网络模型输出的类激活序列和下一个所述神经网络模型输出的视频特征训练下一个所述神经网络模型,具体为:
根据所述视频的真实类别标签和下一个所述神经网络模型输出的类激活序列计算所述视频的分类损失;
根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的视频特征计算融合特征;
根据所述视频的融合特征和下一个所述神经网络模型输出的类激活序列计算所述视频的相似性损失;
根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的类激活序列计算所述视频的时序损失;
根据所述分类损失、所述相似性损失和所述时序损失更新下一个所述神经网络模型的参数。
可选的,所述提取包含动作行为的视频的时空特征,具体为:根据预训练好的网络模型I3D对包含动作行为的视频提取时空特征。
可选的,所述第一神经网络模型包括一层N个节点的全连接层、线性整流层、随机失活层和一层C个节点的全连接层;其中N为所述训练集时空特征中视频帧切分成片段后每个片段的特征维度,C为训练集中所有视频的类别总数。
可选的,所述根据所述检测精度最高值对应的所述神经网络模型对待检测视频进行动作检测,具体为:
提取待检测视频的时空特征;
将所述待检测视频的时空特征输入到所述定位精度最高值对应的所述神经网络模型中,输出类激活序列;
根据所述类激活序列获取所述待检测视频的分类分数;
根据所述分类分数在所述待检测视频中选取包含待检测动作的预测类别;
在所述类激活序列中选取所述预测类别对应的激活序列;
根据所述激活序列选取包含待检测动作的候选动作片段。
一种基于迭代学习的弱监督视频行为检测系统,包括:
时空特征提取模块,用于提取包含动作行为的视频的时空特征;将所述时空特征分为训练集时空特征和测试集时空特征;
神经网络模型组构建模块,用于构建神经网络模型组,所述神经网络模型组包含至少两个神经网络模型;每个所述神经网络模型的输入均为所述训练集时空特征,每个所述神经网络模型的输出均为所述训练集时空特征在对应所述神经网络模型中的类激活序列、时序伪标签和视频特征;
第一训练模块,用于根据所述视频的真实类别标签、第一神经网络模型输出的类激活序列和第一神经网络模型输出的视频特征训练第一神经网络模型;所述第一神经网络模型为所述神经网络模型组中的第一个神经网络模型;
迭代训练模块,用于根据所述视频的真实类别标签、当前所述神经网络模型输出的时序伪标签、下一个所述神经网络模型输出的类激活序列和下一个所述神经网络模型输出的视频特征训练下一个所述神经网络模型;
精度检测模块,用于将所述测试集时空特征输入到各所述神经网络模型中,根据各所述神经网络模型输出的类激活序列分别对所述测试集中对应的每一个测试视频进行动作检测,得到各所述神经网络模型的检测精度;
动作检测模块,用于根据所述检测精度最高值对应的所述神经网络模型对待检测视频进行动作检测。
可选的,所述第一训练模块包括:
损失计算单元,用于根据所述视频的真实类别标签、所述第一神经网络模型输出的类激活序列和所述第一神经网络模型输出的视频特征计算所述视频的分类损失和所述视频的相似性损失;
第一更新单元,用于根据所述分类损失和所述相似性损失更新所述第一神经网络模型的参数。
可选的,所述迭代训练模块包括:
分类损失计算单元,用于根据所述视频的真实类别标签和下一个所述神经网络模型输出的类激活序列计算所述视频的分类损失;
融合特征计算单元,用于根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的视频特征计算融合特征;
相似性损失计算单元,用于根据所述视频的融合特征和下一个所述神经网络模型输出的类激活序列计算所述视频的相似性损失;
时序损失计算单元,用于根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的类激活序列计算所述视频的时序损失;
第二更新单元,用于根据所述分类损失、所述相似性损失和所述时序损失更新下一个所述神经网络模型的参数。
可选的,所述动作检测模块包括:
特征提取单元,用于提取待检测视频的时空特征;
类激活序列输出单元,用于将所述待检测视频的时空特征输入到所述定位精度最高值对应的所述神经网络模型中,输出类激活序列;
分类分数获取单元,用于根据所述类激活序列获取所述待检测视频的分类分数;
预测类别选取单元,用于根据所述分类分数在所述待检测视频中选取包含待检测动作的预测类别;
激活序列选取单元,用于在所述类激活序列中选取所述预测类别对应的激活序列;
候选动作片段选取单元,用于根据所述激活序列选取包含待检测动作的候选动作片段。
根据本发明提供的具体实施例,本发明公开了以下技术效果:
本发明提供了一种基于迭代学习的弱监督视频行为检测方法及系统,包 括:提取包含动作行为的视频的时空特征;构建神经网络模型组;根据视频的真实类别标签、第一神经网络模型输出的类激活序列和第一神经网络模型输出的视频特征训练第一神经网络模型;根据视频的真实类别标签、当前神经网络模型输出的时序伪标签、下一个神经网络模型输出的类激活序列和下一个神经网络模型输出的视频特征训练下一个神经网络模型;将测试集时空特征输入到各神经网络模型中,根据各神经网络模型输出的类激活序列分别对测试集中对应的每一个测试视频进行动作检测,得到各神经网络模型的检测精度。根据检测精度最高值对应的神经网络模型对待检测视频进行动作检测。本发明中根据当前神经网络模型输出的时序伪标签信息训练下一个神经网络模型,可以使神经网络模型学习出的类激活序列更加精准,从而能够准确的检测出视频中的动作。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的基于迭代学习的弱监督视频行为检测方法的流程图;
图2为本发明实施例提供的基于迭代学习的弱监督视频行为检测方法的过程图;
图3为本发明实施例提供的融合特征获取过程图;
图4为本发明实施例提供的时序伪标签输出过程图;
图5为本发明实施例提供的时序损失计算过程图;
图6为本发明实施例提供的基于迭代学习的弱监督视频行为检测系统的系统框图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是 全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明的目的是提供一种基于迭代学习的弱监督视频行为检测方法及系统,能够准确的定位检测出视频中的动作。
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。
实施例1
图1为本发明实施例提供的基于迭代学习的弱监督视频行为检测方法的流程图。图2为本发明实施例提供的基于迭代学习的弱监督视频行为检测方法的过程图。如图1和图2所示,方法包括:
步骤101:提取包含动作行为的视频的时空特征。将所述时空特征分为训练集时空特征和测试集时空特征。在本实施例中,具体为:对于给定的视频v,先抽取视频v的图像帧和光流,然后使用kinetics数据集预训练过的I3D模型对图像帧和光流提取视频的时空特征
Figure PCTCN2020115542-appb-000001
其中T v是视频v所有帧切分成的片段数,N是每个片段的特征维度,N=2048。
步骤102:构建神经网络模型组,所述神经网络模型组包含至少两个神经网络模型;每个所述神经网络模型的输入均为所述训练集时空特征,每个所述神经网络模型的输出均为所述训练集时空特征在对应所述神经网络模型中的类激活序列、时序伪标签和视频特征。
步骤103:根据所述视频的真实类别标签、第一神经网络模型输出的类激活序列和第一神经网络模型输出的视频特征训练第一神经网络模型;所述第一神经网络模型为所述神经网络模型组中的第一个神经网络模型。
在本实施例中,第一神经网络模型包括一层N个节点的全连接层、线性整流层、随机失活层和一层C个节点的全连接层,其中N为训练集时空特征中视频帧切分成片段后每个片段的特征维度,C为训练集中所有视频的类别总数。训练第一神经网络模型过程包括:
步骤1031:根据所述视频的真实类别标签、所述第一神经网络模型输出 的类激活序列和所述第一神经网络模型输出的视频特征计算所述视频的分类损失和所述视频的相似性损失。在本实施例中,步骤1031具体包括以下步骤:
步骤10311:将训练集中视频v对应的时空特征S v输入到一层2048个节点的全连接层、线性整流激活层和随机失活层,得到与检测任务相关的视频特征
Figure PCTCN2020115542-appb-000002
将视频特征
Figure PCTCN2020115542-appb-000003
输入到一层C个节点的全连接层得到视频的类激活序列
Figure PCTCN2020115542-appb-000004
根据视频v的类激活序列
Figure PCTCN2020115542-appb-000005
对每一个类别c对应的激活序列
Figure PCTCN2020115542-appb-000006
选取前k个最高的分数做平均,其中
Figure PCTCN2020115542-appb-000007
得到视频对应的分类分数
Figure PCTCN2020115542-appb-000008
将分类分数经过softmax函数得到分类概率
Figure PCTCN2020115542-appb-000009
计算公式如下:
Figure PCTCN2020115542-appb-000010
其中,C是训练集中所有视频的动作类别总数。
将视频的真实类别标签
Figure PCTCN2020115542-appb-000011
和分类概率
Figure PCTCN2020115542-appb-000012
输入到定义好的分类损失中,得到视频的分类损失。计算公式为:
Figure PCTCN2020115542-appb-000013
其中,L class是训练集中所有视频的分类损失,B为批处理数量,
Figure PCTCN2020115542-appb-000014
步骤10312:根据视频的类激活序列
Figure PCTCN2020115542-appb-000015
找到视频真实类别j对应的激活序列
Figure PCTCN2020115542-appb-000016
然后利用softmax函数得到类已知的注意力权重
Figure PCTCN2020115542-appb-000017
计算公式如下:
Figure PCTCN2020115542-appb-000018
然后利用注意力权重
Figure PCTCN2020115542-appb-000019
计算视频特征X中包含动作j的高权重特征区域 H j和不包含动作j的低权重特征区域L j,计算公式如下:
Figure PCTCN2020115542-appb-000020
Figure PCTCN2020115542-appb-000021
对于包含同种动作j的视频对(m,n),利用公式(4)和公式(5)分别计算出视频m的高权重特征值H j(m)以及低权重特征值L j(m),视频n的高权重特征值H j(n)以及低权重特征值L j(n)。然后根据公式:
Figure PCTCN2020115542-appb-000022
利用余弦相似度来衡量两个特征值X j(m)和X j(n)的相似度。即H j(m)和H j(n)之间的相似性D H[m,n],H j(m)和L j(n)之间的相似性D L[m,n],H j(n)和L j(m)之间的相似性D L[n,m]。根据同种动作的特征向量是相似的,动作与背景特征向量是相异的,利用铰链损失函数拉大动作与背景之间的差异,进而获得视频的相似性损失,具体公式如下:
Figure PCTCN2020115542-appb-000023
其中,L simi是训练集所有视频的相似性损失。S j是训练集中包含动作j的所有视频集合。
步骤1032:根据所述分类损失和所述相似性损失更新所述第一神经网络模型的参数。具体为:利用总损失L 0进行更新第一神经网络模型的参数,其中L 0=γL class+(1-γ)*L simi,系数γ=0.5。
步骤104:根据所述视频的真实类别标签、当前所述神经网络模型输出的时序伪标签、下一个所述神经网络模型输出的类激活序列和下一个所述神经网络模型输出的视频特征训练下一个所述神经网络模型。
在本实施例中,步骤104具体包括以下步骤:
步骤1041:根据所述视频的真实类别标签和下一个所述神经网络模型输出的类激活序列计算所述视频的分类损失。计算过程同步骤10311。
步骤1042:根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的视频特征计算融合特征。图3为本发明实施例提供的融合特征获取过程图。在本实施例中,步骤1042具体包括以下步骤:
步骤10421:根据当前神经网络模型输出的类激活序列
Figure PCTCN2020115542-appb-000024
对视频v中的每一个片段t对应的激活序列
Figure PCTCN2020115542-appb-000025
选取最高分
Figure PCTCN2020115542-appb-000026
作为片段t属于动作前景的得分。其中r=(1,2,...R)是第r次迭代,R是迭代的总次数。当r=1时,A v,r-1是第一个神经网络模型输出的类激活序列。
把所有片段的动作前景的得分经过softmax函数得到类无关权重
Figure PCTCN2020115542-appb-000027
其中
Figure PCTCN2020115542-appb-000028
计算公式如下:
Figure PCTCN2020115542-appb-000029
步骤10422:选取权重位于前h的片段作为第一动作片段,其中
Figure PCTCN2020115542-appb-000030
对于第一动作片段中的每一个动作片段计算其与视频中所有片段的特征相似度,计算公式为:
Figure PCTCN2020115542-appb-000031
其中,x m,x n是当前神经模型输出的视频特征在第m个和第n个片段处的特征。
选择出与第一动作片段中的每一个动作片段时间前后距离为2个片段以内且相似度最高的片段为第二动作片段,把第一动作片段和第二动作片段对应的位置都设为1,其余位置设为0,得到最终的时序伪标签
Figure PCTCN2020115542-appb-000032
其中若片段t为动作片段,则
Figure PCTCN2020115542-appb-000033
否则
Figure PCTCN2020115542-appb-000034
图4为本发明实施例提供的时序伪标签输出过程图。
步骤10423:将当前神经网络模型输出的时序伪标签
Figure PCTCN2020115542-appb-000035
输入到一层2048个节点的全连接层,得到区分动作和背景区域的语义特征
Figure PCTCN2020115542-appb-000036
然后把该语义特征和下一个神经网络模型输出的视频特征
Figure PCTCN2020115542-appb-000037
按一定比例相结合得到视频的融合特征
Figure PCTCN2020115542-appb-000038
融合公式为:
Figure PCTCN2020115542-appb-000039
其中d是比例系数,设为0.1。
步骤1043:根据所述视频的融合特征和下一个所述神经网络模型输出的类激活序列计算所述视频的相似性损失。计算过程同步骤10312。(将步骤10312中的视频特征替换为融合特征)。
步骤1044:根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的类激活序列计算所述视频的时序损失。图5为本发明实施例提供的时序损失计算过程图。在本实施例中,具体过程为:
根据下一个神经网络模型输出的类激活序列A v,r得到类无关权重
Figure PCTCN2020115542-appb-000040
计算过程同步骤10421。然后将当前神经网络模型输出的时序伪标签G v,r-1和类无关权重
Figure PCTCN2020115542-appb-000041
做时序损失。计算公式如下:
Figure PCTCN2020115542-appb-000042
步骤1045:根据所述分类损失、所述相似性损失和所述时序损失更新下一个所述神经网络模型的参数。具体为:
利用总损失L r进行更新下一个神经网络模型的参数,其中
Figure PCTCN2020115542-appb-000043
其中
Figure PCTCN2020115542-appb-000044
是步骤1041中计算的分类损失,
Figure PCTCN2020115542-appb-000045
是步骤1043中计算的相似性损失,
Figure PCTCN2020115542-appb-000046
是步骤1044中计算的时序损失。系数γ为0.5,系数β为0.05。
步骤105:将所述测试集时空特征输入到各所述神经网络模型中,根据各所述神经网络模型输出的类激活序列分别对所述测试集中对应的每一个测试视频进行动作检测,得到各所述神经网络模型的检测精度。
步骤106:根据所述检测精度最高值对应的所述神经网络模型对待检测视频进行动作检测。具体为:
提取待检测视频的时空特征,将待检测视频的时空特征输入到定位精度最高值对应的神经网络模型中,输出类激活序列,根据类激活序列获取待检测视 频的分类分数。根据分类分数在待检测视频中选取包含待检测动作的预测类别,在本实施例中可选取分类分数大于0的类别为预测类别。然后在类激活序列中选取预测类别对应的激活序列。根据对应的激活序列选取包含待检测动作的候选动作片段作为动作检测结果,在本实施例中,可选取激活值大于设定阈值且是两个及以上连续片段作为动作检测结果,其中阈值为max(A t)-(max(A t)-min(A t))*0.5,A t为预测动作i对应的激活序列。
实施例2
本发明还提供了一种基于迭代学习的弱监督视频行为检测系统,该系统应用于实施例1的基于迭代学习的弱监督视频行为检测方法。图6为本发明实施例提供的基于迭代学习的弱监督视频行为检测系统的系统框图,如图6所示,本系统包括:
时空特征提取模块201,用于提取包含动作行为的视频的时空特征;将所述时空特征分为训练集时空特征和测试集时空特征。
神经网络模型构建模块202,用于构建神经网络模型组,所述神经网络模型组包含至少两个神经网络模型;每个所述神经网络模型的输入均为所述训练集时空特征,每个所述神经网络模型的输出均为所述训练集时空特征在对应所述神经网络模型中的类激活序列、时序伪标签和视频特征。
第一训练模块203,用于根据所述视频的真实类别标签、第一神经网络模型输出的类激活序列和第一神经网络模型输出的视频特征训练第一神经网络模型;所述第一神经网络模型为所述神经网络模型组中的第一个神经网络模型。
在本实施例中,第一训练模块203包括:
损失计算单元2031,用于根据所述视频的真实类别标签、所述第一神经网络模型输出的类激活序列和所述第一神经网络输出的视频特征计算所述视频的分类损失和所述视频的相似性损失。
第一更新单元2032,用于根据所述分类损失和所述相似性损失更新所述第一神经网络模型的参数。
迭代训练模块204,用于根据所述视频的真实类别标签、当前所述神经网络模型输出的时序伪标签、下一个所述神经网络模型输出的类激活序列和下一个所述神经网络模型输出的视频特征训练下一个所述神经网络模型。
在本实施例中,迭代训练模块204包括:
分类损失计算单元2041,用于根据所述视频的真实类别标签和下一个所述神经网络模型输出的类激活序列计算所述视频的分类损失。
融合特征计算单元2042,用于根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的视频特征计算融合特征。
相似性损失计算单元2043,用于根据所述视频的融合特征和下一个所述神经网络模型输出的类激活序列计算所述视频的相似性损失。
时序损失计算单元2044,用于根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的类激活序列计算所述视频的时序损失;
第二更新单元2045,用于根据所述分类损失、所述相似性损失和所述时序损失更新下一个所述神经网络模型的参数。
精度检测模块205,用于将所述测试集时空特征输入到各所述神经网络模型中,根据各所述神经网络模型输出的类激活序列分别对所述测试集中对应的每一个测试视频进行动作检测,得到各所述神经网络模型的检测精度。
动作检测模块206,用于根据所述检测精度最高值对应的所述神经网络模型对待检测视频进行动作检测。
在本实施例中,动作检测模块206包括:
特征提取单元2061,用于提取待检测视频的时空特征。
类激活序列输出单元2062,用于将所述待检测视频的时空特征输入到所述检测精度最高值对应的所述神经网络模型中,输出类激活序列。
分类分数获取单元2063,用于根据所述类激活序列获取所述待检测视频的分类分数。
预测类别选取单元2064,用于根据所述分类分数在所述待检测视频中选 取包含待检测动作的预测类别。
激活序列选取单元2065,用于在所述类激活序列中选取所述预测类别对应的激活序列。
候选动作片段选取单元2066,用于根据所述激活序列选取包含待检测动作的候选动作片段。
根据本发明提供的具体实施例,本发明公开了以下技术效果:
(1)本发明中的神经网络模型在训练的时候迭代加入了时序伪标签的监督信息,可以使得学习出的类激活序列更加精准,从而使得定位检测动作更加精准。
(2)本发明中通过将时序伪标签转换成语义特征,并与视频特征相融合,使得视频特征更适用于定位任务,进一步提高了定位精准度。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本发明的限制。

Claims (10)

  1. 一种基于迭代学习的弱监督视频行为检测方法,其特征在于,包括:
    提取包含动作行为的视频的时空特征;将所述时空特征分为训练集时空特征和测试集时空特征;
    构建神经网络模型组,所述神经网络模型组包含至少两个神经网络模型;每个所述神经网络模型的输入均为所述训练集时空特征,每个所述神经网络模型的输出均为所述训练集时空特征在对应所述神经网络模型中的类激活序列、时序伪标签和视频特征;
    根据所述视频的真实类别标签、第一神经网络模型输出的类激活序列和第一神经网络模型输出的视频特征训练第一神经网络模型;所述第一神经网络模型为所述神经网络模型组中的第一个神经网络模型;
    根据所述视频的真实类别标签、当前所述神经网络模型输出的时序伪标签、下一个所述神经网络模型输出的类激活序列和下一个所述神经网络模型输出的视频特征训练下一个所述神经网络模型;
    将所述测试集时空特征输入到各所述神经网络模型中,根据各所述神经网络模型输出的类激活序列分别对所述测试集中对应的每一个测试视频进行动作检测,得到各所述神经网络模型的检测精度;
    根据所述检测精度最高值对应的所述神经网络模型对待检测视频进行动作检测。
  2. 根据权利要求1所述的检测方法,其特征在于,所述根据所述视频的真实类别标签、第一神经网络模型输出的类激活序列和第一神经网络模型输出的视频特征训练第一神经网络模型,具体为:
    根据所述视频的真实类别标签、所述第一神经网络模型输出的类激活序列和所述第一神经网络模型输出的视频特征计算所述视频的分类损失和所述视频的相似性损失;
    根据所述分类损失和所述相似性损失更新所述第一神经网络模型的参数。
  3. 根据权利要求1所述的检测方法,其特征在于,所述根据所述视频的 真实类别标签、当前所述神经网络模型输出的时序伪标签、下一个所述神经网络模型输出的类激活序列和下一个所述神经网络模型输出的视频特征训练下一个所述神经网络模型,具体为:
    根据所述视频的真实类别标签和下一个所述神经网络模型输出的类激活序列计算所述视频的分类损失;
    根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的视频特征计算融合特征;
    根据所述视频的融合特征和下一个所述神经网络模型输出的类激活序列计算所述视频的相似性损失;
    根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的类激活序列计算所述视频的时序损失;
    根据所述分类损失、所述相似性损失和所述时序损失更新下一个所述神经网络模型的参数。
  4. 根据权利要求1所述的检测方法,其特征在于,所述提取包含动作行为的视频的时空特征,具体为:根据预训练好的网络模型I3D对包含动作行为的视频提取时空特征。
  5. 根据权利要求1所述的检测方法,其特征在于,所述第一神经网络模型包括一层N个节点的全连接层、线性整流层、随机失活层和一层C个节点的全连接层;其中N为所述训练集时空特征中视频帧切分成片段后每个片段的特征维度,C为训练集中所有视频的类别总数。
  6. 根据权利要求1所述的检测方法,其特征在于,所述根据所述检测精度最高值对应的所述神经网络模型对待检测视频进行动作检测,具体为:
    提取待检测视频的时空特征;
    将所述待检测视频的时空特征输入到所述定位精度最高值对应的所述神经网络模型中,输出类激活序列;
    根据所述类激活序列获取所述待检测视频的分类分数;
    根据所述分类分数在所述待检测视频中选取包含待检测动作的预测类别;
    在所述类激活序列中选取所述预测类别对应的激活序列;
    根据所述激活序列选取包含待检测动作的候选动作片段。
  7. 一种基于迭代学习的弱监督视频行为检测系统,其特征在于,包括:
    时空特征提取模块,用于提取包含动作行为的视频的时空特征;将所述时空特征分为训练集时空特征和测试集时空特征;
    神经网络模型组构建模块,用于构建神经网络模型组,所述神经网络模型组包含至少两个神经网络模型;每个所述神经网络模型的输入均为所述训练集时空特征,每个所述神经网络模型的输出均为所述训练集时空特征在对应所述神经网络模型中的类激活序列、时序伪标签和视频特征;
    第一训练模块,用于根据所述视频的真实类别标签、第一神经网络模型输出的类激活序列和第一神经网络模型输出的视频特征训练第一神经网络模型;所述第一神经网络模型为所述神经网络模型组中的第一个神经网络模型;
    迭代训练模块,用于根据所述视频的真实类别标签、当前所述神经网络模型输出的时序伪标签、下一个所述神经网络模型输出的类激活序列和下一个所述神经网络模型输出的视频特征训练下一个所述神经网络模型;
    精度检测模块,用于将所述测试集时空特征输入到各所述神经网络模型中,根据各所述神经网络模型输出的类激活序列分别对所述测试集中对应的每一个测试视频进行动作检测,得到各所述神经网络模型的检测精度;
    动作检测模块,用于根据所述检测精度最高值对应的所述神经网络模型对待检测视频进行动作检测。
  8. 根据权利要求7所述的检测系统,其特征在于,所述第一训练模块包括:
    损失计算单元,用于根据所述视频的真实类别标签、所述第一神经网络模型输出的类激活序列和所述第一神经网络模型输出的视频特征计算所述视频的分类损失和所述视频的相似性损失;
    第一更新单元,用于根据所述分类损失和所述相似性损失更新所述第一神经网络模型的参数。
  9. 根据权利要求7所述的检测系统,其特征在于,所述迭代训练模块包括:
    分类损失计算单元,用于根据所述视频的真实类别标签和下一个所述神经网络模型输出的类激活序列计算所述视频的分类损失;
    融合特征计算单元,用于根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的视频特征计算融合特征;
    相似性损失计算单元,用于根据所述视频的融合特征和下一个所述神经网络模型输出的类激活序列计算所述视频的相似性损失;
    时序损失计算单元,用于根据当前所述神经网络模型输出的时序伪标签和下一个所述神经网络模型输出的类激活序列计算所述视频的时序损失;
    第二更新单元,用于根据所述分类损失、所述相似性损失和所述时序损失更新下一个所述神经网络模型的参数。
  10. 根据权利要求7所述的检测系统,其特征在于,所述动作检测模块包括:
    特征提取单元,用于提取待检测视频的时空特征;
    类激活序列输出单元,用于将所述待检测视频的时空特征输入到所述定位精度最高值对应的所述神经网络模型中,输出类激活序列;
    分类分数获取单元,用于根据所述类激活序列获取所述待检测视频的分类分数;
    预测类别选取单元,用于根据所述分类分数在所述待检测视频中选取包含待检测动作的预测类别;
    激活序列选取单元,用于在所述类激活序列中选取所述预测类别对应的激活序列;
    候选动作片段选取单元,用于根据所述激活序列选取包含待检测动作的候选动作片段。
PCT/CN2020/115542 2020-07-07 2020-09-16 一种基于迭代学习的弱监督视频行为检测方法及系统 WO2022007193A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/425,653 US11721130B2 (en) 2020-07-07 2020-09-16 Weakly supervised video activity detection method and system based on iterative learning

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010644474.5A CN111797771B (zh) 2020-07-07 2020-07-07 一种基于迭代学习的弱监督视频行为检测方法及系统
CN202010644474.5 2020-07-07

Publications (1)

Publication Number Publication Date
WO2022007193A1 true WO2022007193A1 (zh) 2022-01-13

Family

ID=72810429

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/115542 WO2022007193A1 (zh) 2020-07-07 2020-09-16 一种基于迭代学习的弱监督视频行为检测方法及系统

Country Status (3)

Country Link
US (1) US11721130B2 (zh)
CN (1) CN111797771B (zh)
WO (1) WO2022007193A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115861903A (zh) * 2023-02-16 2023-03-28 合肥工业大学智能制造技术研究院 一种弱监督视频异常检测方法、系统及模型训练方法
CN116612420A (zh) * 2023-07-20 2023-08-18 中国科学技术大学 弱监督视频时序动作检测方法、系统、设备及存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10984246B2 (en) * 2019-03-13 2021-04-20 Google Llc Gating model for video analysis
JP6800453B1 (ja) * 2020-05-07 2020-12-16 株式会社 情報システムエンジニアリング 情報処理装置及び情報処理方法
KR102504321B1 (ko) * 2020-08-25 2023-02-28 한국전자통신연구원 온라인 행동 탐지 장치 및 방법
CN112926492B (zh) * 2021-03-18 2022-08-12 南京理工大学 一种基于单帧监督的时序行为检测方法及系统
CN113420592B (zh) * 2021-05-14 2022-11-18 东南大学 一种基于代理度量模型的弱监督视频行为定位方法
CN116030538B (zh) * 2023-03-30 2023-06-16 中国科学技术大学 弱监督动作检测方法、系统、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170103264A1 (en) * 2014-06-24 2017-04-13 Sportlogiq Inc. System and Method for Visual Event Description and Event Analysis
US20190095716A1 (en) * 2017-09-26 2019-03-28 Ambient AI, Inc Systems and methods for intelligent and interpretive analysis of video image data using machine learning
CN110188654A (zh) * 2019-05-27 2019-08-30 东南大学 一种基于移动未裁剪网络的视频行为识别方法
CN110516536A (zh) * 2019-07-12 2019-11-29 杭州电子科技大学 一种基于时序类别激活图互补的弱监督视频行为检测方法
CN111079658A (zh) * 2019-12-19 2020-04-28 夸氪思维(南京)智能技术有限公司 基于视频的多目标连续行为分析方法、系统、装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000073996A1 (en) * 1999-05-28 2000-12-07 Glebe Systems Pty Ltd Method and apparatus for tracking a moving object
JP2009280109A (ja) * 2008-05-22 2009-12-03 Toyota Industries Corp 車両周辺監視装置
WO2019114982A1 (en) * 2017-12-15 2019-06-20 Nokia Technologies Oy Methods and apparatuses for inferencing using a neural network
CN110287970B (zh) * 2019-06-25 2021-07-27 电子科技大学 一种基于cam与掩盖的弱监督物体定位方法
CN111079646B (zh) * 2019-12-16 2023-06-06 中山大学 基于深度学习的弱监督视频时序动作定位的方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170103264A1 (en) * 2014-06-24 2017-04-13 Sportlogiq Inc. System and Method for Visual Event Description and Event Analysis
US20190095716A1 (en) * 2017-09-26 2019-03-28 Ambient AI, Inc Systems and methods for intelligent and interpretive analysis of video image data using machine learning
CN110188654A (zh) * 2019-05-27 2019-08-30 东南大学 一种基于移动未裁剪网络的视频行为识别方法
CN110516536A (zh) * 2019-07-12 2019-11-29 杭州电子科技大学 一种基于时序类别激活图互补的弱监督视频行为检测方法
CN111079658A (zh) * 2019-12-19 2020-04-28 夸氪思维(南京)智能技术有限公司 基于视频的多目标连续行为分析方法、系统、装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115861903A (zh) * 2023-02-16 2023-03-28 合肥工业大学智能制造技术研究院 一种弱监督视频异常检测方法、系统及模型训练方法
CN116612420A (zh) * 2023-07-20 2023-08-18 中国科学技术大学 弱监督视频时序动作检测方法、系统、设备及存储介质
CN116612420B (zh) * 2023-07-20 2023-11-28 中国科学技术大学 弱监督视频时序动作检测方法、系统、设备及存储介质

Also Published As

Publication number Publication date
US11721130B2 (en) 2023-08-08
CN111797771A (zh) 2020-10-20
US20220189209A1 (en) 2022-06-16
CN111797771B (zh) 2022-09-09

Similar Documents

Publication Publication Date Title
WO2022007193A1 (zh) 一种基于迭代学习的弱监督视频行为检测方法及系统
CN108985334B (zh) 基于自监督过程改进主动学习的通用物体检测系统及方法
CN110070074B (zh) 一种构建行人检测模型的方法
US20210326638A1 (en) Video panoptic segmentation
CN109299657B (zh) 基于语义注意力保留机制的群体行为识别方法及装置
JP7328444B2 (ja) エンテールメントを用いたキーポイントベースの姿勢追跡
CN108491766B (zh) 一种端到端的基于深度决策森林的人群计数方法
CN109902202B (zh) 一种视频分类方法及装置
Xu et al. Video salient object detection via robust seeds extraction and multi-graphs manifold propagation
Wang et al. Multi-source uncertainty mining for deep unsupervised saliency detection
CN110458022B (zh) 一种基于域适应的可自主学习目标检测方法
JP2022082493A (ja) ノイズチャネルに基づくランダム遮蔽回復の歩行者再識別方法
CN114842553A (zh) 基于残差收缩结构和非局部注意力的行为检测方法
Hammam et al. Real-time multiple spatiotemporal action localization and prediction approach using deep learning
CN112861917A (zh) 基于图像属性学习的弱监督目标检测方法
Liu et al. Multiple people tracking with articulation detection and stitching strategy
CN113283282A (zh) 一种基于时域语义特征的弱监督时序动作检测方法
CN116310293B (zh) 一种基于弱监督学习的生成高质量候选框目标检测方法
CN115294176B (zh) 一种双光多模型长时间目标跟踪方法、系统及存储介质
Fan et al. Video anomaly detection using CycleGan based on skeleton features
CN111915648B (zh) 一种基于常识和记忆网络的长期目标运动跟踪方法
CN113989920A (zh) 一种基于深度学习的运动员行为质量评估方法
Wen et al. Streaming video temporal action segmentation in real time
CN112598056A (zh) 一种基于屏幕监控的软件识别方法
CN112541403A (zh) 一种利用红外摄像头的室内人员跌倒检测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20944271

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20944271

Country of ref document: EP

Kind code of ref document: A1