CN108805083B - Single-stage video behavior detection method - Google Patents

Single-stage video behavior detection method Download PDF

Info

Publication number
CN108805083B
CN108805083B CN201810607804.6A CN201810607804A CN108805083B CN 108805083 B CN108805083 B CN 108805083B CN 201810607804 A CN201810607804 A CN 201810607804A CN 108805083 B CN108805083 B CN 108805083B
Authority
CN
China
Prior art keywords
behavior
training
video
anchoring
scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810607804.6A
Other languages
Chinese (zh)
Other versions
CN108805083A (en
Inventor
王子磊
刘志康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201810607804.6A priority Critical patent/CN108805083B/en
Publication of CN108805083A publication Critical patent/CN108805083A/en
Application granted granted Critical
Publication of CN108805083B publication Critical patent/CN108805083B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a single-stage video behavior detection method, which comprises the following steps: in the training stage, constructing a multi-scale behavior segment regression network based on a convolutional neural network; training a multi-scale behavior segment regression network by using a multi-task learning end-to-end optimization method by taking a training video and a frame-level real behavior label as input to obtain a trained multi-scale behavior segment regression network model; in the using stage, when a new video is input, generating an input frame sequence with the same length as a training video through a time dimension sliding window, and predicting the behavior category and the corresponding time position of the input frame sequence by using a trained multi-scale behavior segment regression network model; and then, the non-maximum value inhibition is used for processing the prediction result to generate a final behavior detection result. The method can improve the detection performance and the detection efficiency.

Description

Single-stage video behavior detection method
Technical Field
The invention relates to the technical field of video behavior detection, in particular to a single-stage video behavior detection method.
Background
In recent years, video shooting devices (such as smart phones, digital cameras, monitoring cameras and the like) are rapidly popularized, so that people can conveniently shoot videos, modern communication devices enable the acquisition and the transmission of the videos to be more and more convenient, and the videos become important information carriers in modern society. With the increasing demand for computer intelligence and the rapid development of pattern recognition technology, image processing technology and artificial intelligence technology, the analysis of video content using computer vision technology has huge practical requirements and high commercial value. Human activities are often the main bodies of information in videos, and detection of human behaviors in videos is of great significance to video understanding. The video human behavior detection task is to detect the category of each human behavior instance contained in a video and locate the occurrence time of each behavior instance in an undivided long video. As most of the monitoring videos and the network videos are unsegmented long videos, the detection in the long videos is more in line with the actual requirements.
With the development of deep learning technology, some research achievements are obtained in the field of video behavior detection. However, the field of video behavior detection is still in the development and initiation stage, the current video behavior detection method is often not mature enough, and the problems of too complex model, too high calculation cost, low behavior positioning accuracy and the like generally exist. In order to meet the requirements of practical application, a new video behavior detection framework and a new video behavior detection method are urgently needed to be provided.
At present, few researches are conducted on video behavior detection tasks, and the proposed method generally follows a multi-stage detection framework: in the first stage, candidate time windows with high recall rate are generated in the video by using a nomination technology (proposal), or differentiated behavior characteristics are generated by using an additional characteristic extraction technology; in the next stage, these candidate time windows or behavior features are classified to obtain a prediction of behavior class. A two-stage method is used in a convolutional neural network-based action detection model, namely, an interest window is generated on a video frame and a light flow graph by using a fast RCNN network to give a name and extract behavior characteristics, and then an independent SVM classifier is used for classifying the behavior characteristics. In the patent "a video motion detection method based on convolutional neural network", a dense multi-scale sliding window is used to segment an uncut video in the first stage, a convolutional neural network with a space-time pyramid layer is used to identify each window, and then the identification results of each window are screened and integrated in the next stage to obtain a final video detection segment. A behavior detection method based on a segmented three-dimensional convolutional neural network is proposed in a paper 'Temporal Action Localization in unknown video Multi-stage CNNs', firstly, a three-dimensional convolutional neural network is used for generating behavior example nominations based on a sliding window, and then, another three-dimensional convolutional neural network is used for classifying the nominations. The paper "captured Boundary Regression for temporal action Detection" adopts a two-stage behavior Detection framework, and performs Regression operation on the time Boundary of a behavior to further improve the time Boundary of sliding window nomination. A Single-channel behavior classifier for behavior Detection is proposed in the paper Single-Shot Temporal Action Detection, which uses an independent double-stream neural network (two-stream ConvNets) to extract appearance features and motion features.
However, the multi-stage method treats feature extraction, sliding window nomination and behavior classification as independent processing stages, and each stage cannot be jointly trained, so that the cooperative cooperation and joint optimization of a behavior detection model are not facilitated; meanwhile, a large amount of repeated calculation exists in different stages, and the calculation efficiency of the algorithm is influenced.
Disclosure of Invention
The invention aims to provide a single-stage video behavior detection method which can improve the detection performance and the detection efficiency.
The purpose of the invention is realized by the following technical scheme:
a single-stage video behavior detection method comprises the following steps:
in the training stage, constructing a multi-scale behavior segment regression network based on a convolutional neural network; training a multi-scale behavior segment regression network by using a multi-task learning end-to-end optimization method by taking a training video and a frame-level real behavior label as input to obtain a trained multi-scale behavior segment regression network model;
in the using stage, when a new video is input, generating an input frame sequence with the same length as a training video through a time dimension sliding window, and predicting the behavior category and the corresponding time position of the input frame sequence by using a trained multi-scale behavior segment regression network model; and then, the non-maximum value inhibition is used for processing the prediction result to generate a final behavior detection result.
According to the technical scheme provided by the invention, firstly, the constructed multi-scale behavior segment regression network completely eliminates a time sequence nomination stage and an additional characteristic extraction stage in the traditional behavior detection method, all calculations of behavior instance detection in an untrimmed long video are completed in a single convolutional neural network, and end-to-end joint training and optimization can be realized on the whole, so that higher detection performance is achieved; secondly, the network structure is simplified, so that most of calculation can be realized in parallel, and the efficiency of behavior detection is greatly improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a single-stage video behavior detection method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a process of detecting behavior in a video according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an overall structure of a multi-scale behavior segment regression network according to an embodiment of the present invention;
fig. 4 is a schematic diagram of the output result of the thumb' 14 data set according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
In order to solve the problems of complex structure, low detection precision, low processing speed and the like of the conventional video behavior detection method, the embodiment of the invention provides a single-stage video behavior detection method; firstly, in order to improve the calculation efficiency, the method encapsulates all calculations into a network, and completes a behavior detection task in a single-stage convolutional neural network. Secondly, in order to improve the behavior detection precision, the method flexibly detects human behaviors of various time lengths on a multi-scale network characteristic diagram by using multi-scale position regression, and outputs behavior time boundaries and behavior categories at the video frame level. Finally, in order to enable joint optimization of various parts of the network, the method processes input video in a single network, so that the whole network can be trained end to end.
As shown in fig. 1, a flowchart of a single-stage video behavior detection method is provided for an embodiment of the present invention, which mainly includes:
1. in the training stage, constructing a multi-scale behavior segment regression network based on a convolutional neural network; and taking the training video and the frame-level real behavior labels as input, and training the multi-scale behavior segment regression network by using an end-to-end optimization method of multi-task learning to obtain a trained multi-scale behavior segment regression network model.
In the embodiment of the invention, a single-stage convolutional neural network is used for completing a behavior detection task, and network characteristic graphs with different scales are connected with anchoring behavior examples with different time lengths, so that the network can flexibly detect human behaviors with various time lengths. The method mainly comprises the following parts:
1) and constructing a multi-scale behavior segment regression network based on the convolutional neural network.
In the embodiment of the invention, the constructed multi-scale behavior segment regression network comprises: the system comprises a basic generalization module, a behavior instance anchoring module and a behavior prediction module; wherein:
a. the basic generalization module comprises N arranged alternately1(e.g., N)15) three-dimensional convolution layer (3D convolution layer) and N2(e.g., N)25) three-dimensional maximum pooling layer (3D max-posing layer) for feature generalization of the input video sequence and expansion of the receptive field.
b. The behavior instance anchoring module adopts N3(e.g., N)34) step size s in the layer time dimension1(e.g. s)12) step length (stride) in spatial dimension is s2(e.g. s)21) for associating an anchor behavior instance of different time length for each cell of the anchor feature map output by each three-dimensional convolution layer of the module.
In the embodiment of the invention, in the behavior instance anchoring moduleEach anchor profile defines a base time scale sk,k∈[1,N3];skAre regularly distributed over a value range [0,1 ]]Performing the following steps; defining a set of scale ratios for each anchor profile
Figure BDA0001694781630000041
DkIs the number of scale ratios; the size of each anchor feature map is represented as h × w × t, and the height, width and length of the anchor feature map are represented by h, w and t, so that each cell association D with the size of h × w × 3 of the anchor feature mapkAnchoring the behavior instance with a time length of ld=sk·rd,d∈[1,Dk]The central position is the center of the cell.
c. The behavior prediction module is used for using D for each cell of the anchoring characteristic diagramkConvolving (m +2) convolution kernels with the size h multiplied by w multiplied by 3, and outputting D corresponding to the corresponding cellkThe predicted scores for the m behavior classes for each anchor behavior instance and two time position offsets.
2) End-to-end optimization method for multi-task learning
In the embodiment of the invention, a training video and a frame-level real behavior label are taken as input, a training objective function is combined, parameters in a multi-scale behavior segment regression network are trained through a gradient descent method, and the training process is as follows:
a. extracting video frames from a training video by using a fixed frame rate (such as 10 frames/second), obtaining a training frame picture sequence, and adjusting each frame picture to a uniform resolution; and sequentially sliding sub-sequences with the same length of the window to serve as an input frame sequence, wherein the length of each sliding window is the maximum frame number (for example: 192 frames) allowed by the GPU video memory.
b. And establishing a corresponding relation between the anchoring behavior instance and the real behavior instance in the label by using a positive sample matching strategy.
In the embodiment of the invention, in each training sample, the overlapping degree (interaction-over-Unit, IoU) of each anchoring behavior instance and each label real behavior instance in the time dimension is calculated, if the overlapping degree exceeds a fixed threshold (such as 0.5), the corresponding anchoring behavior instance is taken as a positive sample, otherwise, the corresponding anchoring behavior instance is taken as a negative sample; wherein one tag real instance can match multiple anchor behavior instances.
c. And (3) adopting a multitask loss function as a target function of the multi-scale behavior segment regression network training, training by a random gradient descent method, and generating a final multi-scale behavior segment regression network model by iteration.
In the embodiment of the invention, the multitask loss function refers to an objective function L of network traininglossThe medium joint behavioral classification loss and the behavioral time-position regression loss are expressed as:
Figure BDA0001694781630000051
in the above formula, L2(Θ) is the L2 normalized loss, Θ is the learned parameters in all the regression networks of multi-scale behavior segments, α and β are loss trade-off parameters that are used to control the time-shift loss and the normalized loss, respectively, NclsAnd NposThe number of total training samples and positive samples respectively,
Figure BDA0001694781630000052
is the prediction vector for the behavior class of the ith anchor position, where the superscript j represents the jth behavior class, for m behavior classes.
Figure BDA0001694781630000053
Indicating the predicted score for the tag true category at the ith anchor position, and the superscript g indicates that the g-th behavior category is the true tag category. t is tiFor the time position offset of the anchor instance,
Figure BDA0001694781630000054
transforming coordinates of the real position and the anchoring position of the label;
Lclsfor behavior classification loss, set to multi-class soft maximum loss:
Figure BDA0001694781630000055
Llocto act as a time position regression penalty, a smoothed L1 penalty set to the time position offset is set.
2. In the using stage, when a new video is input, generating an input frame sequence with the same length as a training video through a time dimension sliding window, and predicting the behavior category and the corresponding time position of the input frame sequence by using a trained multi-scale behavior segment regression network model; and then, the non-maximum value inhibition is used for processing the prediction result to generate a final behavior detection result.
According to the scheme of the embodiment of the invention, the constructed multi-scale behavior segment regression network completely eliminates a time sequence nomination stage and an additional characteristic extraction stage in the traditional behavior detection method, all calculations of untrimmed video behavior instance detection are completed in a single convolutional neural network, and end-to-end joint training and optimization can be realized on the whole, so that higher detection performance is achieved; secondly, the network adopts a full convolution mode, so that the network structure is simplified, most of calculation can be realized in parallel, the behavior detection efficiency is greatly improved, and particularly, the detection speed is higher than that of the existing recording method through the GPU parallel calculation acceleration.
For ease of understanding, the following description is made in conjunction with specific examples.
The entire flow of this example is similar to the previous embodiment, namely: in the training stage, firstly, a multi-scale behavior segment regression network is constructed; the multi-scale behavior segment regression network completes all behavior detection calculations in a single stage, other redundant links are eliminated, meanwhile, the network uses a full convolution mode, and most calculations can be accelerated in parallel, so that the calculation efficiency is greatly improved; then, extracting a video frame picture sequence from the training video, taking the frame sequence and the frame-level real behavior label generated by a sliding window method on the video frame picture sequence as the input of a multi-scale behavior segment regression network, training network parameters by an end-to-end optimization method of multi-task learning, and generating a network model. In the using stage, for a newly input video, generating a sequence with the same length as an input frame sequence in a sliding window mode after extracting a video frame, and inputting the sequence into a trained multi-scale behavior segment regression network for behavior detection; the output results are then processed using non-maximum suppression (NMS) to produce the final behavior detection result.
The specific detection procedure is shown in fig. 2, and the input is a sequence R composed of RGB video frames extracted from the video3 ×H×W×TT, H and W are the length, height and width of the input sequence, respectively, and the number of RGB picture channels is 3. The input sequence sequentially passes through the basic generalization module, the behavior instance anchoring module and the behavior prediction module, and the classification category and the position offset of the anchoring behavior instance are output. And calculating the time position of the final action according to the position offset. t, h and w represent the length, height and width of the anchor feature map, respectively. m represents the number of behavior classes.
The present implementation uses video data from a representative human behavior recognition data set thumb' 14 data set. The THUMOS' 14 dataset is provided by (THUMOS change: Action recognition with a large number of classes. http:// crcv. ucf. edu/THUMOS14/., 2014.).
The following introduces from three aspects of establishing a multi-scale behavior segment regression network, end-to-end optimization training of multi-task learning, testing and evaluation.
1. And constructing a multi-scale behavior segment regression network.
As shown in fig. 3, a schematic diagram of a multi-scale behavior segment regression network provided for this example. It mainly comprises: the system comprises a basic generalization module, a behavior instance anchoring module and a behavior prediction module; wherein:
the basic generalization module consists of 5 three-dimensional convolution layers (3D convolution) and 5 three-dimensional maximum value pooling layers (3D max-Pooling). The sizes of the three-dimensional convolution kernels are all 3 multiplied by 3, the number of the convolution kernels contained in the first three layers is respectively 64, 128 and 256, and the number of the convolution kernels contained in the remaining two layers is 512. The kernel size of the first 3 three-dimensional pooling layers was set to 2 × 2 × 2, and the kernel size of the remaining three-dimensional pooling layers was set to 1 × 2 × 2. The output characteristic map of the fifth max pooling layer is denoted as F5, and has a height and width of h-w-3 and a time dimension of t-24. And for convolution output of each layer, a linear correction unit ReLU is adopted as an activation function, and nonlinear mapping modeling capability is added to the network.
Behavior instance anchoring modules use three-dimensional convolutional layers, where each three-dimensional convolutional layer in the module is referred to as an anchoring layer, and the output of each layer is referred to as an anchoring feature map. In all anchor layers, the convolution kernel size is set to 3 × 3 × 3, and the number of convolution kernels per layer in the time dimension is 256. The output anchor characteristics of the respective layers of the anchor layer are plotted as F6, F7, F8 and F9, and have sizes of (256 × 12 × 3 × 3), (256 × 6 × 3 × 3), (256 × 3 × 3 × 3), and (256 × 1 × 3 × 3), respectively. The output F5 of the last layer of the basic generalization module, with a size of (256 × 24 × 3 × 3), is also used as the anchor feature map. And expanding (pad) one dimension end to end on the second dimension of each anchoring feature map, wherein the size of each unit cell on each anchoring feature map is (256 multiplied by 3). The basic scales of the anchor feature maps F5, F6, F7, F8, and F9 are set to {0.1,0.3,0.5,0.7,0.9} in this order. The scale ratios of F5, F6, F7, and F8 are set to {0.8,1,1.5}, and the scale ratio of F9 is set to {0.7,0.85,1 }. The number of the scale ratios on each anchoring characteristic diagram is 3, so that each cell (256 × 3 × 3 × 3) on each anchoring characteristic diagram is associated with 3 anchoring behavior instances, the length of each anchoring behavior instance is respectively multiplied by the corresponding basic scale of the anchoring characteristic diagram by the corresponding 3 scale ratios, and the central position of each anchoring behavior instance is the center of the corresponding cell.
A behavior prediction module to predict a position offset for the respective anchor position and behavior class using a three-dimensional convolution. Each cell is associated with 3 anchor behavior instances, taking the THUMOS' 14 dataset as an example, scores and front and back 2 time position offsets of 20 behavior classes and 1 background class need to be predicted, so on each cell of F5, F6, F7, F8 and F9, convolution is performed by using 3 × (20+1+2) ═ 69 convolution kernels with the size of 3 × 3 × 3, and the output of each convolution is the scores and front and back 2 time position offsets of the corresponding 3 anchor behavior instances in 20 behavior classes, 1 background class.
2. End-to-end optimization training for multi-task learning.
Because the GPU has limited video memory, it is not possible to input a complete video every time for a long video, and the video needs to be processed to generate a suitable input. Therefore, in the training phase, video frames are first extracted from the training video of the thumbs' 14 data set at a frame rate of 10 frames/second, and the picture size of each frame is unified to 96 × 96, i.e., H ═ W ═ 96. Then, the window is sequentially slid in the sequence of video frames, producing successive frames of T192 as the sequence of input frames.
After the input frame sequence is generated, the specific positive and negative samples need to be further determined because the input frame sequence contains a behavior segment and a background segment. The specific method comprises the following steps: and for each input frame sequence, calculating the overlapping degree of each anchoring behavior instance and the corresponding label real instance in the multi-scale behavior segment regression network in the time dimension, and if the overlapping degree exceeds a threshold value of 0.5, taking the anchoring instance as a positive sample, and otherwise, taking the anchoring instance as a negative sample. Wherein, one tag real instance can match a plurality of anchor behavior instances, but one anchor behavior instance can only match one tag real instance.
After positive and negative samples exist, a multi-task loss function is adopted as a target function of network training, and multi-scale behavior segment regression network parameters are trained through a random gradient descent method. The multitask penalty function is defined as:
Figure BDA0001694781630000081
in the above formula, L2(Θ) is the L2 normalized loss, Θ is the learned parameters in all the regression networks of multi-scale behavior segments, α and β are loss trade-off parameters that are used to control the time-shift loss and the normalized loss, respectively, NclsAnd NposThe number of total training samples and positive samples respectively,
Figure BDA0001694781630000082
is a prediction vector for the behavior class of the ith anchor position, whereThe notation j indicates the jth behavior class, for m behavior classes.
Figure BDA0001694781630000083
Indicating the predicted score for the tag true category at the ith anchor position, and the superscript g indicates that the g-th behavior category is the true tag category. t is tiFor the time position offset of the anchor instance,
Figure BDA0001694781630000084
transforming coordinates of the real position and the anchoring position of the label;
Lclsfor behavior classification loss, set to multi-class soft maximum loss:
Figure BDA0001694781630000085
Llocto act as a time position regression penalty, a smoothed L1 penalty set to the time position offset is set.
3. And (6) testing and evaluating.
After training a multi-scale behavior segment regression network on the thumb' 14 dataset, the performance of the network is evaluated as follows: on a test video set of a THUMOS' 14 data set, extracting a video frame sequence at a frame rate of 10 frames/second for each video, performing sliding window by using 192 frames as step sizes to generate a test video frame sequence with the same length as a training video frame sequence, and sending the test video frame sequence into a trained multi-scale behavior segment regression network to obtain a predicted behavior category and a corresponding time position. The output results are then processed using non-maximum suppression (NMS) to produce the final behavior detection result. And finally, comparing the behavior detection result with the real behavior label of the test data set to obtain the evaluation result of the network. FIG. 4 is a diagram showing the results of behavior detection on a test video set of the THUMOS' 14 data set.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (5)

1. A single-stage video behavior detection method is characterized by comprising the following steps:
in the training stage, constructing a multi-scale behavior segment regression network based on a convolutional neural network; training a multi-scale behavior segment regression network by using a multi-task learning end-to-end optimization method by taking a training video and a frame-level real behavior label as input to obtain a trained multi-scale behavior segment regression network model;
in the using stage, when a new video is input, generating an input frame sequence with the same length as a training video through a time dimension sliding window, and predicting the behavior category and the corresponding time position of the input frame sequence by using a trained multi-scale behavior segment regression network model; then, the non-maximum value inhibition is used for processing the prediction result to generate a final behavior detection result;
wherein, the constructed multi-scale behavior segment regression network comprises: the system comprises a basic generalization module, a behavior instance anchoring module and a behavior prediction module; wherein:
the basic generalization module comprises N arranged alternately1Three-dimensional layer of convolution layers and N2A layer three-dimensional maximum pooling layer for feature generalization of the input video sequence,and enlarging the receptive field;
the behavior instance anchoring module adopts N3Step length in the layer time dimension is s1Step in spatial dimension is s2The three-dimensional convolution network is used for associating anchoring behavior examples with different time lengths for each cell of the anchoring characteristic diagram output by each three-dimensional convolution layer of the module;
the behavior prediction module is used for using D for each cell of the anchoring characteristic diagramkConvolving (m +2) convolution kernels with the size h multiplied by w multiplied by 3, and outputting D corresponding to the corresponding cellkThe predicted scores and two time position offsets of the anchoring behavior instances for the m behavior classes; wherein, h and w respectively represent the height and width of the anchoring characteristic diagram.
2. The method as claimed in claim 1, wherein the behavior instance anchoring module defines a basic time scale s for each anchoring feature mapk,k∈[1,N3](ii) a Defining a set of scale ratios for each anchor profile
Figure FDA0003321865840000011
DkIs the number of scale ratios equal to the number of anchor behavior instances; each anchor feature map is of a size h x w x t, t represents the length of the anchor feature map, and each cell association D of the anchor feature map is of a size h x w x 3kAnchoring the behavior instance with a time length of ld=sk·rd,d∈[1,Dk]The central position is the center of the cell.
3. The single-stage video behavior detection method according to claim 1,
training parameters in a multi-scale behavior segment regression network by using a gradient descent method by taking a training video and a frame-level real behavior label as input and combining a training objective function, wherein the training process is as follows:
extracting video frames from a training video by using a fixed frame rate, obtaining a training frame picture sequence, and adjusting each frame picture to a uniform resolution; sequentially using subsequences with the same length of sliding windows as input frame sequences, wherein the length of each sliding window is the maximum frame number allowed by a GPU video memory;
establishing a corresponding relation between the anchoring behavior instance and the real behavior instance in the label by using a positive sample matching strategy;
and (3) adopting a multitask loss function as a target function of the multi-scale behavior segment regression network training, training by a random gradient descent method, and generating a final multi-scale behavior segment regression network model by iteration.
4. The method according to claim 3, wherein the establishing the correspondence between the anchor behavior instance and the real behavior instance in the tag by using the positive sample matching policy comprises:
in each training sample, calculating the overlapping degree of each anchoring behavior instance and each label real behavior instance in the time dimension, if the overlapping degree exceeds a fixed threshold, taking the corresponding anchoring behavior instance as a positive sample, and otherwise, taking the corresponding anchoring behavior instance as a negative sample; wherein one tag real instance can match multiple anchor behavior instances.
5. The method as claimed in claim 3, wherein the multi-task loss function is an objective function L of network traininglossThe medium joint behavioral classification loss and the behavioral time-position regression loss are expressed as:
Figure FDA0003321865840000021
in the above formula, L2(Θ) normalized loss for L2, Θ being a learned parameter in all multi-scale behavioral segment regression networks, α and β being loss trade-off parameters used to control temporal offset loss and normalized loss, respectivelyChemical loss, NclsAnd NposThe number of total training samples and positive samples respectively,
Figure FDA0003321865840000022
is a prediction vector for the behavior class of the ith anchor position, where the superscript j represents the jth behavior class, for a total of m behavior classes;
Figure FDA0003321865840000023
representing the predicted score of the label real category at the ith anchor position, and the superscript g represents that the g-th behavior category is the real label category; t is tiFor the time position offset of the anchor instance,
Figure FDA0003321865840000024
transforming coordinates of the real position and the anchoring position of the label;
Lclsfor behavior classification loss, set to multi-class soft maximum loss:
Figure FDA0003321865840000025
Llocto act as a time position regression penalty, a smoothed L1 penalty set to the time position offset is set.
CN201810607804.6A 2018-06-13 2018-06-13 Single-stage video behavior detection method Active CN108805083B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810607804.6A CN108805083B (en) 2018-06-13 2018-06-13 Single-stage video behavior detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810607804.6A CN108805083B (en) 2018-06-13 2018-06-13 Single-stage video behavior detection method

Publications (2)

Publication Number Publication Date
CN108805083A CN108805083A (en) 2018-11-13
CN108805083B true CN108805083B (en) 2022-03-01

Family

ID=64085637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810607804.6A Active CN108805083B (en) 2018-06-13 2018-06-13 Single-stage video behavior detection method

Country Status (1)

Country Link
CN (1) CN108805083B (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697434B (en) * 2019-01-07 2021-01-08 腾讯科技(深圳)有限公司 Behavior recognition method and device and storage medium
CN109829398B (en) * 2019-01-16 2020-03-31 北京航空航天大学 Target detection method in video based on three-dimensional convolution network
CN109816023B (en) * 2019-01-29 2022-01-04 北京字节跳动网络技术有限公司 Method and device for generating picture label model
CN110059584B (en) * 2019-03-28 2023-06-02 中山大学 Event naming method combining boundary distribution and correction
CN110059658B (en) * 2019-04-26 2020-11-24 北京理工大学 Remote sensing satellite image multi-temporal change detection method based on three-dimensional convolutional neural network
CN110084202B (en) * 2019-04-29 2023-04-18 东南大学 Video behavior identification method based on efficient three-dimensional convolution
CN110222592B (en) * 2019-05-16 2023-01-17 西安特种设备检验检测院 Construction method of time sequence behavior detection network model based on complementary time sequence behavior proposal generation
CN110348345B (en) * 2019-06-28 2021-08-13 西安交通大学 Weak supervision time sequence action positioning method based on action consistency
CN110610194B (en) * 2019-08-13 2022-08-05 清华大学 Data enhancement method for small data video classification task
CN110633645A (en) * 2019-08-19 2019-12-31 同济大学 Video behavior detection method based on enhanced three-stream architecture
CN110659572B (en) * 2019-08-22 2022-08-12 南京理工大学 Video motion detection method based on bidirectional feature pyramid
CN110796069B (en) * 2019-10-28 2021-02-05 广州云从博衍智能科技有限公司 Behavior detection method, system, equipment and machine readable medium
CN111259779B (en) * 2020-01-13 2023-08-01 南京大学 Video motion detection method based on center point track prediction
CN111259783A (en) * 2020-01-14 2020-06-09 深圳市奥拓电子股份有限公司 Video behavior detection method and system, highlight video playback system and storage medium
CN111325097B (en) * 2020-01-22 2023-04-07 陕西师范大学 Enhanced single-stage decoupled time sequence action positioning method
CN111553238A (en) * 2020-04-23 2020-08-18 北京大学深圳研究生院 Regression classification module and method for time axis positioning of actions
CN111814588B (en) * 2020-06-18 2023-08-01 浙江大华技术股份有限公司 Behavior detection method, related equipment and device
CN111898461B (en) * 2020-07-08 2022-08-30 贵州大学 Time sequence behavior segment generation method
CN111832479B (en) * 2020-07-14 2023-08-01 西安电子科技大学 Video target detection method based on improved self-adaptive anchor point R-CNN
CN113033500B (en) * 2021-05-06 2021-12-03 成都考拉悠然科技有限公司 Motion segment detection method, model training method and device
CN113505266B (en) * 2021-07-09 2023-09-26 南京邮电大学 Two-stage anchor-based dynamic video abstraction method
CN114339403B (en) * 2021-12-31 2023-03-28 西安交通大学 Video action fragment generation method, system, equipment and readable storage medium
CN114882403B (en) * 2022-05-05 2022-12-02 杭州电子科技大学 Video space-time action positioning method based on progressive attention hypergraph
CN116996661B (en) * 2023-09-27 2024-01-05 中国科学技术大学 Three-dimensional video display method, device, equipment and medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164694B (en) * 2013-02-20 2016-06-01 上海交通大学 A kind of human action knows method for distinguishing
CN105740773B (en) * 2016-01-25 2019-02-01 重庆理工大学 Activity recognition method based on deep learning and multi-scale information
CN106407903A (en) * 2016-08-31 2017-02-15 四川瞳知科技有限公司 Multiple dimensioned convolution neural network-based real time human body abnormal behavior identification method
CN107729799A (en) * 2017-06-13 2018-02-23 银江股份有限公司 Crowd's abnormal behaviour vision-based detection and analyzing and alarming system based on depth convolutional neural networks
CN108133188B (en) * 2017-12-22 2021-12-21 武汉理工大学 Behavior identification method based on motion history image and convolutional neural network

Also Published As

Publication number Publication date
CN108805083A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
CN108805083B (en) Single-stage video behavior detection method
CN111639692B (en) Shadow detection method based on attention mechanism
CN109697434B (en) Behavior recognition method and device and storage medium
CN109697435B (en) People flow monitoring method and device, storage medium and equipment
CN108921051B (en) Pedestrian attribute identification network and technology based on cyclic neural network attention model
CN106096561B (en) Infrared pedestrian detection method based on image block deep learning features
CN112184752A (en) Video target tracking method based on pyramid convolution
CN107633226B (en) Human body motion tracking feature processing method
CN110120064B (en) Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning
CN111259779B (en) Video motion detection method based on center point track prediction
US20150235079A1 (en) Learning device, learning method, and program
Hu Design and implementation of abnormal behavior detection based on deep intelligent analysis algorithms in massive video surveillance
Chen et al. Learning linear regression via single-convolutional layer for visual object tracking
CN109002755B (en) Age estimation model construction method and estimation method based on face image
CN111414875B (en) Three-dimensional point cloud head posture estimation system based on depth regression forest
CN107767416B (en) Method for identifying pedestrian orientation in low-resolution image
CN112884742A (en) Multi-algorithm fusion-based multi-target real-time detection, identification and tracking method
CN111639540A (en) Semi-supervised character re-recognition method based on camera style and human body posture adaptation
Shafiee et al. Embedded motion detection via neural response mixture background modeling
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
CN114492634B (en) Fine granularity equipment picture classification and identification method and system
CN108257148B (en) Target suggestion window generation method of specific object and application of target suggestion window generation method in target tracking
CN111291785A (en) Target detection method, device, equipment and storage medium
Nooruddin et al. A multi-resolution fusion approach for human activity recognition from video data in tiny edge devices
Hasan et al. Tiny head pose classification by bodily cues

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant