CN113139467A

CN113139467A - Hierarchical structure-based fine-grained video action identification method

Info

Publication number: CN113139467A
Application number: CN202110444382.7A
Authority: CN
Inventors: 杨旸; 杨文涛
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-07-20
Anticipated expiration: 2041-04-23
Also published as: CN113139467B

Abstract

The fine-grained video motion recognition method based on the hierarchical structure aims at realizing fine-grained motion recognition in a video, and specifically comprises a two-stage process: the first stage identifies the large category of the action in a certain long-time video, and on the basis, the second stage identifies the fine-grained action; the method comprises the following specific steps: step one, data grading processing and feature extraction; secondly, extracting video characterization features; thirdly, inter-segment fusion, double-stream fusion and prediction; fourthly, extracting fine-grained action features; and fifthly, predicting and classifying fine-grained actions. The method is applied to fine-grained action classification, and can effectively finish identification and classification on fine-grained video actions.

Description

Hierarchical structure-based fine-grained video action identification method

Technical Field

The invention relates to the field of behavior recognition, in particular to a fine-grained video motion recognition method based on a hierarchical structure.

Background

Behavior recognition algorithms are a fundamental research problem in the field of computer vision, and the main content of the algorithm is to analyze human behaviors in videos, and generally classify human actions in given videos. Behavior recognition has been applied to many aspects of life, such as social monitoring, public safety, human-computer interaction, smart home, and the like. Many behavior recognition algorithms have been proposed, but it is still a challenging task to obtain better video representation and more detailed fine-grained motion recognition.

The best performing algorithms before deep learning enters the field of behavior recognition are the Dense trajectory method DT (Dense trajectories) [1] and the modified Dense trajectory method iDT (advanced Dense trajectories) [2 ]. The symbolic work of deep learning applied to the field of behavior recognition is the proposal of two stream network [3 ]. The double-flow network processes the video into a space flow (representing target) and a time sequence flow (representing action), and finally, the double flows are fused to obtain a classification result. The TSN (temporal Segment networks) 4 network is also a dual-stream fusion mode based on spatial stream time sequence stream, but it is a parallel operation of multiple networks, and finally, the fusion between segments and the dual-stream fusion are performed. In addition to the dual stream concept, 3D networks are also applied in behavior recognition. For example, C3D (conditional 3D) network [5] proposed 3D ConvNets to train on large-scale video data sets to learn spatio-temporal features of video, choosing the size 3 × 3 of the optimal convolution kernel. Both appearance and motion information can be modeled using C3D. In addition, there are skeleton-based behavior recognition methods, such as behavior recognition using a spatio-temporal graph convolutional network [6 ]. The algorithm models dynamic bones based on a time series representation of human joint positions and extends graph convolution into a space-time graph convolution network to capture such space-time variation relationships. The fine-grained actions have high similarity in scenes, clothes and postures, so that the algorithm is not strong in applicability, and meanwhile, the algorithm for classifying the fine-grained actions is relatively few.

[1]Heng Wang,Alexander

Cordelia Schmid,et al.Action Recognition by Dense Trajectories.The IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2011,Colorado Springs,United States.pp.3169-3176.

[2]Wang H,Schmid C.Action Recognition with Improved Trajectories.Proceedings of the 2013 IEEE International Conference on Computer Vision.IEEE,2013.

[3]Simonyan K,Zisserman A.Two Stream Convolutional Networks for Action Recognition in Videos,Advances in neural information processing systems,2014.

[4]Wang L,Xiong Y,Wang Z,et al.Temporal Segment Networks:Towards Good Practices for Deep Action Recognition European Conference on Computer Vision 2016.

[5]Tran D,Bourdev L,Fergus R,et al.Learning Spatio temporal Features with 3D Convolutional Networks.2015IEEE International Conference on Computer Vision(ICCV).

[6]Yan S,Xiong Y,Lin D.Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition.The IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2018.

Disclosure of Invention

In order to solve the problems in the prior art in fine-grained behavior recognition, the invention provides a fine-grained video action recognition method based on a hierarchical structure.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the method comprises the steps of firstly, carrying out hierarchical data processing on a long-time-sequence video, extracting a frame of RGB image and extracting optical flow information near the frame for each section after the long-time-sequence video is segmented; secondly, sending a plurality of video frames and optical flow characteristics of the long time sequence video into a plurality of parallel double-flow networks for characteristic extraction, wherein each double-flow network consists of a space flow and a time sequence flow; thirdly, fusing the segments by a plurality of parallel networks, then fusing the spatial stream and the time sequence stream, giving higher weight to the spatial stream during the fusion, and outputting the major category of the video action by the fusion information through a prediction function; fourthly, after the large-class motion recognition is completed, the fine-grained motion obtained by the hierarchical data processing is recognized, and on the basis of the known large-class motion to which the fine-grained motion belongs, one frame of image and interframe optical flow information are extracted from each section of fine-grained motion and input into the double-flow network; fifthly, performing double-flow fusion on double-flow network output, giving higher weight to the time sequence flow during fusion, and performing video fine-grained action identification through a prediction function; the first stage of the two stages includes the first to third steps, and the second stage includes the fourth to fifth steps.

The long-time-sequence video hierarchical data processing method comprises the following specific implementation processes: the data processing of an original input video is hierarchical, and long-time sequence video sampling multi-frame information of a certain complete action is used as the representation of the video and comprises multi-frame images and inter-frame optical flow information; and then, the long-time-sequence action video is divided into a plurality of fine-grained action segments, each video segment comprises a section of fine-grained action, and each section of fine-grained action video samples one frame of information to be used as the representation of the current fine-grained action video segment.

The method comprises the following steps of sending a plurality of video frames and optical flow characteristics of the long time sequence video into a plurality of parallel double-flow networks for characteristic extraction, wherein the specific implementation process comprises the following steps: the video feature processing structure is in a layered double-stage mode, wherein a first stage processes multi-frame RGB images and inter-frame optical flow information obtained by long-time video sampling, and a plurality of double-flow networks perform feature extraction in parallel.

Extracting a frame of image and interframe optical flow information from each fine-grained action and inputting the extracted frame of image and interframe optical flow information into a double-flow network, wherein the specific implementation process comprises the following steps: the video feature processing structure is a hierarchical two-stage structure, wherein the second stage processes a frame of image and optical flow information of fine-grained motion video sampling, and a single network is used for feature extraction.

The multiple parallel networks are fused between segments, then the spatial stream and the time sequence stream are fused, the spatial stream is endowed with higher weight during fusion, and the specific implementation process is as follows: the processing weights are different during the spatial time sequence fusion of the two stages, after the multiple parallel network segments are fused during the large class identification of the first stage, the spatial characteristics relative to the time sequence characteristics occupy the main position in the large class identification, and the spatial stream has higher weight than the time sequence stream when the weighted fusion is adopted for the spatial stream time sequence stream.

The double-flow network output performs double-flow fusion, and gives higher weight to the time sequence flow during fusion, and the specific implementation process is as follows: the processing weights are different during the spatial time sequence fusion of the two stages, when fine-grained action recognition is carried out in the second stage, due to the fact that spatial information is close, time sequence characteristics and spatial characteristics occupy the main position in the fine-grained action recognition, and time sequence flows have higher weights than spatial flows in the process that the spatial flow time sequence flows adopt weighting fusion.

Compared with the prior art, the invention has the following innovation points:

because fine-grained human body actions often have higher similarity in scenes, clothes, postures and even motion trajectories, the traditional video behavior recognition algorithm is not ideal in fine-grained action classification effect. The invention provides a novel hierarchical double-stage fine-grained behavior recognition method, on the basis of a hierarchical data processing and double-stage feature processing structure, a first stage recognizes the category of a fine-grained action through extracted video features, and space flow is given higher weight in the process; and the second stage completes fine-grained action identification on the basis of the known large category, and the process gives higher weight to the time sequence flow. Compared with the traditional behavior recognition algorithm for fine-grained action recognition, the method can achieve better recognition effect.

Drawings

Fig. 1 is a flowchart of a fine-grained video motion recognition method for a dual-flow network according to the present invention.

Fig. 2(a) shows extraction of RGB frames of a certain video, fig. 2(b) shows a horizontal component of current frame optical flow information, and fig. 2(c) shows a vertical component of current frame optical flow information.

Fig. 3 is a structural diagram of a two-stage hierarchical fine-grained motion recognition method.

Fig. 4 is a basic flow of the first stage large class category identification.

Fig. 5 is a basic flow of the second stage fine-grained action recognition.

Detailed Description

The invention is described in further detail below with reference to the following figures and embodiments:

as shown in fig. 1, the fine-grained video motion recognition method based on the hierarchical structure is specifically implemented as a two-stage process: the method comprises the steps that in the first stage, a large category such as archery is identified to which actions in a certain long-time video belong; and on the basis, the second stage identifies fine-grained actions, such as the process of drawing a bow by the fine-grained actions in archery movement. The method comprises the following specific steps: step one, performing hierarchical data processing on a long-time sequence video, and extracting an RGB image of each section and optical flow information near the frame after the long-time sequence video is segmented; secondly, sending a plurality of video frames and optical flow characteristics into a plurality of parallel double-flow networks for characteristic extraction, wherein each double-flow network consists of a space flow and a time sequence flow; thirdly, fusing the segments by a plurality of parallel networks, then fusing the spatial stream and the time sequence stream, giving higher weight to the spatial stream during the fusion, and outputting the major category of the video action by the fusion information through a prediction function; fourthly, after the large-class motion recognition is completed, the fine-grained motion obtained by the hierarchical data processing is recognized, and on the basis of the known large-class motion to which the fine-grained motion belongs, one frame of image and interframe optical flow information are extracted from each section of fine-grained motion and input into the double-flow network; and fifthly, performing double-flow fusion on the double-flow network output, giving higher weight to the time sequence flow during fusion, and performing video fine-grained action identification through a prediction function.

The fine-grained video motion recognition method based on the hierarchical structure is characterized in that the whole two-stage process is shown in FIG. 3. The first stage comprises a first step to a third step, and the specific frame structure is shown in FIG. 4; the second stage includes the fourth step to the fifth step, and the specific framework is shown in fig. 5. The specific process of each step is described in detail below.

The first step is as follows: data classification processing and feature extraction

Firstly, video data is processed in a grading mode, and fine-grained division is carried out on a long-time sequence video to obtain each fine-grained action segment forming the long-time sequence video. When the feature frame is sampled for long-time sequence video modeling, the extracted features cannot completely contain information required by behavior recognition due to too small sampling rate, and feature information redundancy is caused due to too large sampling rate, so that the computational complexity is improved. Therefore, a sparse sampling method is adopted to equally divide the video into several independent video segments according to time length, specifically, a video segment is divided into K segments, and the K segments are marked as { S₁,S₂,…,S_kRandomly sampling each fragment to obtain an RGB frame so as to represent the spatial information of the video; based on the above samplingThe video frame obtains the optical flow information of the current frame and the frames nearby the current frame so as to represent the motion information of the video. Processing all K fragments to obtain the representation of each video segment, and recording the representation as { T }₁,T₂,…,T_kAnd each element comprises the spatial characteristics and the time sequence motion characteristics of the video. Fig. 2(a), 2(b) and 2(c) are extracted video representations, where fig. 2(a) represents the extracted RGB frames, fig. 2(b) is the horizontal component of the optical flow information, and fig. 2(c) is the vertical component of the optical flow information.

The second step is that: video characterization feature extraction

And (3) inputting the video representation extracted in the step (1) into a plurality of parallel double-flow networks, wherein each double-flow network is composed of two branches of a time sequence flow and a space flow. The spatial characteristics of the video, namely RGB frame information input spatial stream, are subjected to characteristic extraction; the time sequence characteristic of the video, namely the optical flow information is input into the time sequence flow for characteristic extraction. The concrete expression is as follows: applying a network with parameter w to segment T_kScore of post-output network, denoted as F (T)_k,w)。

The third step: intersegment fusion, dual stream fusion and prediction

After the features extracted from the multiple video segments are obtained in the step 2, the aggregation function is adopted to fuse the network prediction scores among the multiple segments, and the method is specifically represented as follows:

G＝G(F(T₁,w),F(T₂,w),…,F(T_k,w)) (1)

and G is an aggregation function among a plurality of video segments, the specific form adopts an average pooling function, and the average of the network output scores belonging to the same class is taken as the final network score of the current class. Meanwhile, the network adopts a variant cross entropy loss function, which is defined as:

where y is the true value, G is the aggregation function between multiple video segments, C is the number of categories, and subscripts i, j are the category indices.

And performing double-stream fusion after obtaining the respective category prediction scores of the spatial stream and the time sequence stream, and specifically adopting a weighted average form to obtain the category prediction score of each large class, wherein the step aims at performing large class identification, and a strategy of giving higher weight to the spatial stream is adopted for realizing the distinction between different large classes, and the weight is specifically expressed as that the spatial stream is 2:1 in comparison with the time sequence stream. And on the basis of obtaining the class prediction score, performing probability prediction on each class by adopting a prediction function H, wherein the form of H specifically adopts a general softmax function.

The fourth step: fine-grained motion feature extraction

And 3, under the condition that the category of the fine-grained motion belongs to the large category obtained in the step 3, performing similar operation on the fine-grained motion obtained by data classification processing, extracting single video frames and interframe optical flows of fine-grained motion segments, representing spatial target information and time sequence motion information according to the single video frames and interframe optical flows, and inputting the spatial target information and the time sequence motion information into a double-flow network for feature extraction. The basic feature extraction network is a BN-inclusion block, so that convergence is accelerated, overfitting is restrained, and dropout operation is introduced; in order to solve the problem that the amount of fine-grained action data is relatively small, data expansion operations including random cutting, horizontal turning, corner cutting and multi-scale cutting are adopted.

The fifth step: fine-grained action prediction classification

And 4, performing fusion between the two streams after the output of the two-stream network with the fine-grained action is obtained through the step 4, and considering that the motion information contained in the time-series stream is the key for distinguishing the fine-grained action under the condition that the spatial background and the target appearance characteristics of the fine-grained action are similar, performing weighted average on the two-stream fusion is to give higher weight to the time-series stream, and specifically, the weight of the spatial stream to the time-series stream is 1: 2. And performing probability prediction by a prediction function softmax after the double-flow fusion, and outputting a final fine-grained action category.

Claims

1. The fine-grained video motion recognition method based on the hierarchical structure is characterized by comprising the following steps: the fine-grained action recognition is composed of two stages, wherein the first stage recognizes large categories, and the second stage recognizes fine-grained actions on the basis of the first stage; the method specifically comprises the following steps: step one, performing hierarchical data processing on a long-time sequence video, and extracting an RGB image of each section and optical flow information near the frame after the long-time sequence video is segmented; secondly, sending a plurality of video frames and optical flow characteristics of the long time sequence video into a plurality of parallel double-flow networks for characteristic extraction, wherein each double-flow network consists of a space flow and a time sequence flow; thirdly, fusing the segments by a plurality of parallel networks, then fusing the spatial stream and the time sequence stream, giving higher weight to the spatial stream during the fusion, and outputting the major category of the video action by the fusion information through a prediction function; fourthly, after the large-class motion recognition is completed, the fine-grained motion obtained by the hierarchical data processing is recognized, and on the basis of the known large-class motion to which the fine-grained motion belongs, one frame of image and interframe optical flow information are extracted from each section of fine-grained motion and input into the double-flow network; fifthly, performing double-flow fusion on double-flow network output, giving higher weight to the time sequence flow during fusion, and performing video fine-grained action identification through a prediction function; the first stage of the two stages includes the first to third steps, and the second stage includes the fourth to fifth steps.

2. The hierarchical structure-based fine-grained video motion recognition method according to claim 1, characterized in that: in the first step, the step of processing the long-time-sequence video by stages specifically comprises the following steps: the data processing of an original input video is hierarchical, and long-time sequence video sampling multi-frame information of a certain complete action is used as the representation of the video and comprises multi-frame images and inter-frame optical flow information; and then, the long-time-sequence action video is divided into a plurality of fine-grained action segments, each video segment comprises a section of fine-grained action, and each section of fine-grained action video samples one frame of information to be used as the representation of the current fine-grained action video segment.

3. The hierarchical structure-based fine-grained video motion recognition method according to claim 1, characterized in that: in the second step, the feature extraction is performed by sending the video frames and the optical flow features of the long-time sequence video into a plurality of parallel double-flow networks, specifically: the video feature processing structure is in a layered double-stage mode, wherein a first stage processes multi-frame RGB images and inter-frame optical flow information obtained by long-time video sampling, and a plurality of double-flow networks perform feature extraction in parallel.

4. The hierarchical structure-based fine-grained video motion recognition method according to claim 1, characterized in that: in the fourth step, extracting a frame of image and interframe optical flow information from each fine-grained action and inputting the extracted frame of image and interframe optical flow information into a double-flow network specifically comprises the following steps: the video feature processing structure is a hierarchical two-stage structure, wherein the second stage processes a frame of image and optical flow information of fine-grained motion video sampling, and a single network is used for feature extraction.

5. The hierarchical structure-based fine-grained video motion recognition method according to claim 1, characterized in that: in the third step, the multiple parallel networks perform the fusion between the segments, and then the spatial stream and the time sequence stream are fused, and the spatial stream is given higher weight during the fusion, specifically: the processing weights are different during the spatial time sequence fusion of the two stages, after the multiple parallel network segments are fused during the large class identification of the first stage, the spatial characteristics relative to the time sequence characteristics occupy the main position in the large class identification, and the spatial stream has higher weight than the time sequence stream when the spatial stream is weighted and fused.

6. The hierarchical structure-based fine-grained video motion recognition method according to claim 1, characterized in that: and fifthly, performing double-flow fusion on the double-flow network output, and giving higher weight to the time sequence flow during fusion, specifically: the processing weights are different during the spatial time sequence fusion of the two stages, when fine-grained action recognition is carried out in the second stage, due to the fact that spatial information is close, time sequence characteristics and spatial characteristics occupy the main position in the fine-grained action recognition, and time sequence flows have higher weights than spatial flows in the process that the spatial flow time sequence flows adopt weighting fusion.