CN116311495A

CN116311495A - Dual-stream global-local action recognition method, system, equipment and storage medium based on video input

Info

Publication number: CN116311495A
Application number: CN202310070774.0A
Authority: CN
Inventors: 苗启广; 梁思宇; 李宇楠; 陈绘州; 史媛媛; 刘如意; 盛立杰; 刘向增; 谢琨; 卢子祥; 宋建锋; 刘林润佳; 权义宁
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-01-19
Filing date: 2023-01-19
Publication date: 2023-06-23

Abstract

A method, a system, a device and a storage medium for identifying double-flow global-local actions based on video input, wherein the method comprises the following steps: collecting video input with single recognition object behavior, extracting recognition object feature key points by using a recognition object key point recognition method, selecting key points, and cutting the recognition object behavior video frame by frame to obtain local images of a plurality of corresponding areas; performing data preprocessing on a plurality of local video inputs and global video inputs composed of the local images of the identification objects; training a local feature extraction module and a global feature extraction module by using a local video and an original video input network respectively; adding a local feature enhancement module and a result fusion structure to cooperatively train to obtain a global-local action recognition model; performing action recognition; the system, the device and the medium are used for realizing a double-flow global-local action recognition method for multi-local extraction; the method is simple to operate, and can improve the effect of the action recognition prediction result obtained by the overall double-flow overall-local action recognition method.

Description

Dual-stream global-local action recognition method, system, equipment and storage medium based on video input

Technical Field

The invention belongs to the technical field of video processing and understanding, and particularly relates to a double-flow global-local action recognition method, system, equipment and storage medium based on video input.

Background

The human body behavior recognition technology mainly comprises three aspects of human body target recognition, human body tracking and behavior recognition. Wherein the behavior recognition is based on a higher level computer vision portion of the first two. The robust behavior recognition algorithm is researched to have important theoretical significance and wide application prospect, wherein the robust behavior recognition algorithm comprises intelligent video monitoring and video retrieval. In order to reduce the interference of background redundant information, human body dynamic information in a video is learned, and a plurality of methods are used for identification by fusing information of multiple modes. The global local method comprises the step of intercepting human body parts in the video to conduct behavior recognition. StNet connects the continuous N frames of images in RGB channel dimension as the global representation of video, called hypergraph, obtains local space-time feature in hypergraph, then combines local space-time feature and carries out feature extraction in time dimension to obtain global space-time feature. The attention-based method uses the attention mechanism to emphasize local information in the video as a branch, and performs softmax layer fusion with the global network to fuse global local features.

Patent application CN113761992a discloses a video action recognition method, comprising: acquiring a video; processing an implicit layer of a video input neural network model to obtain an identification object in the video and a motion corresponding to the identification object, wherein the implicit layer of the neural network model comprises a plurality of processing units; based on the identification object in the video and the motion corresponding to the identification object, outputting the action identification result of the video; the method comprises the steps of sequentially extracting spatial features and temporal features in a video which is input by at least one processing unit in the at least one processing unit, combining the spatial features and the temporal features in the video, performing point-by-point convolution operation, and outputting spatial semantic information and temporal semantic information of the video. According to the technical scheme provided by the application, the video processing process has stronger time-space relation coding capability, more meaningful features can be extracted by a smaller number of parameters, and more useful information can be learned from the data set by using a more compact structure. In implementing video processing, only a single processor is required to process the amount of video previously processed by multiple processors

The existing behavior recognition method comprises the steps of processing a single video input, obtaining a prediction result from the processed time and space characteristics, and not deeply mining the space characteristic information of more details in the video image. In addition, some methods use multi-modal fusion to characterize global features in a parallel processing mode of dual-stream and even multi-stream models with the same structure, and neglect a large amount of fine-grained local information. The method of local interception is used to emphasize identifying object local to obtain dynamic information, and global information and some useful fine-grained information are ignored while identifying local. Global local methods based on image stitching and attention mechanisms emphasize local information, but local fine-grained information has been lost in the preprocessing process prior to feature extraction.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a dual-flow global-local motion recognition method, a system, equipment and a storage medium based on video input, which can learn global information of object motion and pay attention to local fine-grained feature information at the same time of recognizing the global information of object motion in video, can obtain local detail information of a recognized object from local video input obtained by local interception operation, and can obtain more dynamic local fine-grained information of video by combining different local features to obtain better global features.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a double-flow global-local action recognition method based on video input specifically comprises the following steps:

step 1, acquiring video input with single recognition object action, extracting recognition object key points in the video by using a recognition object key point recognition method, selecting the extracted recognition object key points, and cutting according to the key point positions to obtain a plurality of recognition object partial images; performing data preprocessing on a plurality of local video inputs and global video inputs formed by a plurality of identification object local images;

step 2, extracting action characteristics in global and local videos by adopting a double-flow global-local action recognition network; the local feature extraction network and the global feature extraction network are used for carrying out feature extraction operation on the input local video and original video data respectively;

step 3, adding a local feature enhancement module, processing an intermediate output feature map containing local information in a local network to obtain an attention guiding mask, and enhancing an intermediate result of the global network in a space dimension and a time dimension by using the attention guiding mask;

and 4, cooperatively training the double-flow network and the enhancement module to obtain a global-local action recognition model, and performing action recognition.

The specific method of the step 1 is as follows:

1.1 Using a key point recognition network to recognize a plurality of key points of the recognition objects from the video of the single recognition object action obtained from the acquisition input, and cutting out a local image taking the key points as the center according to the recognized key points of the recognition objects;

1.2 Identification object key point data is set as J _i ＝(x _i ,y _i ),i＝1...clip _size ，J _i Representing the position of the key point of the identification object in one image, and setting the local image with the key point as the center as a box with side length _len Square crop of (2) _box Obtaining clip of each frame _size To identify object key point J _i The shape being a box at the center _len ×box _len Is a local image of the center of the keypoint of (c):

crop _box ＝(x _i -box _len /2,y _i -box _len /2),(x _i +box _len /2,y _i +box _len /2)

from each frame, G partial images I can be obtained _local ：

Wherein the crop function represents the part where the crop_box is located cut out from the input image, and the function P (·) represents the data preprocessingOperation, local video Input of local image composition _local Formalized definition is as follows:

inputted into

The specific method of the step 2 is as follows:

2.1 The global feature extraction network extracts action features in the global video to obtain a prediction result;

2.2 A local feature extraction network extracts action features in the local video;

2.2.1 2), dividing a video sequence by taking key points of different parts of G identification objects as centers in the video frame of the local video obtained by preprocessing in the step 2;

2.2.2 Using the result of the step 2.2.1) as input data to perform local network feature extraction: splitting the input data into G groups of local data, respectively extracting network characteristics, and then merging; namely, the input data is regarded as the combination of G groups of local data, and the local modules can respectively process the G groups of local data;

2.2.3 After the network feature extraction in the step 2.2.2) is finished, dividing the data into G groups, respectively obtaining prediction results, and then averaging to obtain final prediction results.

The specific method of the step 3 is as follows:

3.1 Acquiring local average characteristics; the output intermediate feature map of the global network is

The intermediate feature map of the output of the local feature extraction network is L2 _out ，/>

Firstly, processing an output feature map of a local feature extraction network to obtain local average features, and dividing the features into groups in the filter dimension;

then respectively handle

Averaging the feature maps to obtain a local average feature map:

3.2 Time series alignment; the feature L2 is subjected to downsampling in the process of feature extraction _out Compressed to T in the time dimension T ₂ Feature Full _L1out Compressed to T in the time dimension ₁ Duplicating the frame_size dimension in the time dimension results in

The time dimension of the generated attention guiding mask is consistent with the target feature map of the guided full-frame stream, and the local clip_size average features are aligned in time sequence;

3.3 Creating an attention directing mask: using L2 _{out_3} A spatiotemporal attention-directing mask is established,

first, establish an AND _{Full_C1_out} Empty Mask with identical time and space dimensions _empty ，

Calculating the Shape of the local average feature restored to the guiding mask according to the cut-out frame of the image data in the local video sequence _trans ∈R ^h1×w1 ；

Then, the shape of the local average feature is scaled from 16×16 to h ₁ ×w ₁ ；

Finally, in the spatial dimension, G local feature images are formed according to the relative positions of the local features in the frameIntegrating into the Mask of a frame, in the time dimension, according to the relative position of the local features changing on different frames, establishing an attention guiding Mask which has the same shape as the feature map of the target full-frame stream frame by frame _attention Guide Mask _attention Representing fusion of the understanding of the local spatial information of the G group of local networks in the time dimension.

The specific method of the step 4 is as follows:

after the attention guiding mask is obtained, the mask is multiplied by a feature map output by a global network Conv1 element by element point to obtain the guided feature

After the local feature enhancement module is added, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a final multi-local extraction double-flow global-local action recognition network.

A double-flow global-local action recognition system based on video input comprises a recognition object key point feature extraction module, a data preprocessing module and a recognition module;

the recognition object key point feature extraction module is used for extracting recognition object key points in the video by using a recognition object key point recognition method, selecting the extracted recognition object key points, and cutting out a plurality of recognition object parts according to the key point positions to obtain a plurality of recognition object parts as the input of the local feature extraction module;

the data preprocessing module performs data preprocessing on the input global and local by using a cutting and overturning mode which is conventional in the field.

The identification module comprises a local feature extraction module, a global feature extraction module and a local feature enhancement module; the local feature extraction module is constructed by using the local bottleneck structure provided by the invention as a basic module, and processes a plurality of local inputs obtained by the data preprocessing module, and finally gathers to obtain an identification result; the global feature extraction module processes global features to obtain an identification result; the local feature enhancement module, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a global-local action recognition model; the action recognition is performed by the method.

A dual stream global-local motion recognition device based on video input, comprising:

a memory for storing a computer program;

and the processor is used for realizing the double-flow global-local action recognition method based on the video multi-local extraction in the steps 1 to 4 when executing the computer program.

A computer readable storage medium storing a computer program which, when executed by a processor, implements the multi-local extraction dual stream global-local action recognition method of steps 1-4.

Compared with the prior art, the invention has the following advantages and beneficial effects:

firstly, the method and the device can automatically position a plurality of selected key parts of the identification object in each frame of image in the input video by combining the local interception operation, do not need to manually carry out additional identification object key part labeling work, reduce the requirement on input data, and only need to input the original video. The local video image of the identification object obtained by using the local interception operation can better represent the local details of the identification object compared with the video image fed into the global network.

Secondly, in the double-flow network, the local feature extraction network can process a plurality of processed local input video images of the identification object, so that the time and resource consumption for feature extraction in video identification are reduced, the local video image obtained through local interception operation is predicted after feature extraction, and the method is beneficial to more local detail visual information and can obtain better prediction results;

thirdly, the local feature enhancement module added in the global-local double-flow network of the invention gathers the features of a plurality of key parts of the identification object in the local feature extraction module in the space-time field again to obtain the attention guiding mask. By adding the module, the capacity of the global feature extraction module can be improved, so that the effect of the action recognition prediction result obtained by the overall double-flow global-local action recognition method is improved.

Drawings

Fig. 1 is a flowchart of a dual-stream global-local motion recognition method based on video multi-local extraction according to an embodiment of the invention.

Fig. 2 is a comparison of a global feature extraction network and a local feature extraction network model according to an embodiment of the present invention.

Fig. 3 is a comparison of bottleneck structures used by the global feature extraction network and the local feature extraction network model according to an embodiment of the present invention, fig. 3 (a) is a global ResNeXt bottleneck structure, and fig. 3 (b) is a local bottleneck structure.

Fig. 4 is a flowchart of a method for identifying a dual-flow global local action based on video multi-local extraction according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a system according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.

Examples:

as shown in fig. 1, taking a human action behavior as an example, the present embodiment provides a dual-stream global-local action recognition method based on video input, including the following steps:

s1, acquiring video input with single human body action content, extracting human body key points in the video by using a human body key point identification method, selecting the extracted human body key points, and cutting according to the key point positions to obtain a plurality of human body partial images; performing data preprocessing on a plurality of local video inputs and global video inputs formed by a plurality of human body local images;

more specifically:

extracting frame_size frames from the acquired video of single human body action as identification objects, and extracting 7 human body key points of a person in the video by using an HRNet network as a key point identifier: nose tip, left and right shoulders, left and right elbows, and the distal ends of the middle fingers of the left and right hands. The size of the side length box_len of the clipping box is set to be the average value of 1.5 times of the distance from the nose tip of the human body to the left shoulder in all frames of the video, and a square image with the side length box_len is obtained by processing the square image with the key point as the center. In order to meet the setting of 8 partial images input by the partial feature extraction module, the facial image with the tip of the nose as the center is duplicated. And finally obtaining the position information of 8 human body partial images to be intercepted.

And identifying a plurality of human body key points from the video of the single human body action obtained from the acquisition input by using a key point identification network, and cutting out a local image taking the key points as the center according to the identified human body key points. Human body key point data is set as J _i ＝(x _i ,y _i ),i＝1...clip _size 。J _i Representing the location of a human keypoint in an image. Setting a local image with a key point as a center as a box with a side length _len Square crop of (2) _box Obtaining clip of each frame _size Based on the key points J of human body _i The shape of the center is a partial image of the center of the key point of box_len×box_len.

From each frame, G partial images i_local can be obtained:

I_local＝P(crop(I _k ,crop_box)),I _local ∈R ^3×128×128

the crop function represents a portion where the crop_box is located, and the function P (·) represents a data preprocessing operation, specifically, adjusting the shape size of the input image to 3×box_len×box_len. The definition of the local video Input input_local formalization of the local image composition is as follows:

inputted into

As shown in fig. 4, taking an example of a frame of image with 512×512 shape in the input video, the average distance from the nose tip to the left shoulder in the video is 107 pixels, we cut 160×160 parts of the face video information representing the person from the figure, and the local information corresponding to other 7 predefined human body points is the same. Because of the amount of parameters required to control the model, the input of the ResNeXt network is preprocessed by data to become a picture set with the shape of 128×128. Through simple calculation, the global feature extraction network is first-class, and the local information of the frame can only be reserved by 25% after data preprocessing. The local feature extraction network can learn the picture data containing 80% of local detail information. The difference in the amount of information contained in the full frame input and the partial information input can also be visually seen from fig. 4.

S2, extracting action characteristics in global and local videos by adopting a double-flow global-local action recognition network; the local feature extraction network and the global feature extraction network are used for carrying out feature extraction operation on the input local video and original video data respectively;

more specifically:

the global feature extraction network uses a ResNeXt network to extract structures in the global video that include one convolutional layer and four bottleneck blocks containing residual structures, as shown in fig. 3.

The local feature extraction network uses inputs from the original video data, each set of data being a video sequence segmented from video frames centered around a human keypoint. Local feature extraction network architecture design as shown in fig. 3, conv1 layer of ResNeXt is also set as group convolution layer with group number G, and 1×1 convolution and 3×3 convolution in bottleneck structure are also set as group convolution layer with group number G. As can be seen from fig. 3, such a design ensures that the model can learn different sets of local features separately during the learning process. The result of Layer4 obtains 2048-dimensional prediction vector through adaptive avgpool3d Layer, represents the identification result of clip_size local parts, splits it into clip_size× (2048/clip_size), sends it into fc Layer to obtain the prediction result of each part, and finally averages the clip_size group result. Specific parameter information of the global feature extraction network and the local feature extraction network is shown in fig. 2.

S3, adding a local feature enhancement module and a result fusion structure to cooperatively train to obtain a global-local motion recognition model, and using the model to perform motion recognition.

More specifically:

the output of the global network Conv1 is enhanced in the spatial and temporal dimensions using a feature map of the Local network containing intermediate outputs of Local information. The local feature enhancement module is shown in the figure as being divided into three parts: 1. local average features are acquired 2. Time series alignment 3. An attention directing mask is established.

The Conv1 output of the global network is characterized by:

the layer2 output of the ResNeXt-Local network is characterized by:

firstly, dividing the features into clip_size groups in the filter dimension, and then respectively averaging 512/clip_size feature graphs to obtain a local average feature graph

Time series alignment is then required. Since the downsampling is performed during feature extraction, the feature L2_out is compressed from frame_size in the time dimension to +.>

The feature full_l1_out is compressed to +_in the time dimension>

In order to make the time dimension of the generated attention directing mask coincide with the target feature map of the directed full frame stream, the frame_size dimension is replicated in the time dimension, resulting in +.>

Finally, use L2 _{out_3} A spatiotemporal attention guidance mask is established.

The creation of the spatiotemporal attention-directing mask is divided into three steps. First, a null mask is created which is identical to the full_c1_out time and space dimensions

Computing Shape of restoration of local average feature to guide mask based on truncated frame at local image data acquisition _trans ∈R ^h1×w1 . Then, the shape of the local average feature is scaled from 16×16 to h1×w1. Finally, in the spatial dimension, G local feature maps are integrated into the mask of a frame based on the relative position of the local feature in the frame. In the time dimension, according to the relative positions of the local features changed on different frames, establishing an attention guiding Mask which is the same as the shape of the feature map of the target full-frame stream frame by frame _attention . Guide Mask _attention Representing fusion of the understanding of the local spatial information of the G group of local networks in the time dimension.

S4, after the attention guiding mask is obtained, the mask is multiplied by a feature map output by the global network Conv2 element by element point to obtain the guided feature Att_full_C1_out.

In particular, the network structure in this embodiment may be implemented by using other network structures capable of achieving the same technical effects, for example, the global feature extraction network ResNeXt in fig. 2 and 3 may be any form of feature extraction network, and the Local feature extraction ResNeXt-Local network may be any form of feature extraction network that performs splitting-feature extraction-merging operations on input features.

As shown in fig. 5, a dual-stream global-local motion recognition system based on video input includes a human body key point feature extraction module, a data preprocessing module and a recognition module;

the human body key point feature extraction module extracts human body key points in the video by using a human body key point identification method, selects the extracted human body key points, and cuts out a plurality of human body parts according to the key point positions to be used as the input of the local feature extraction module;

the data preprocessing module performs data preprocessing on the input global and local by using a cutting and overturning mode which is conventional in the field;

the recognition module comprises a local feature extraction module, a global feature extraction module and a local feature enhancement module, wherein the local feature extraction module is constructed by using the local bottleneck structure provided by the invention as a basic module, and processes a plurality of local inputs obtained by the data preprocessing module, and finally gathers to obtain a recognition result; the global feature extraction module processes global features to obtain an identification result; the local feature enhancement module, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a global-local action recognition model; the action recognition is performed by the method.

It should be noted that, in the system, only the division of the functional modules is illustrated, and in practical application, the function allocation may be performed by different functional modules, that is, the internal structure may be divided into different functional modules, so as to perform all or part of the functions described above, and the system is applied to the first person viewing angle action recognition method of the above embodiment.

A dual stream global-local motion recognition device based on video input, comprising: a memory for storing a computer program;

a processor for implementing the dual stream global-local action recognition method based on video multi-local extraction of any one of the steps of claim 1 to claim 5 when executing said computer program.

As shown in fig. 6, a storage medium stores a program that, when executed by a processor, implements the first person perspective action recognition method of the above embodiment, specifically:

extracting human body key points in the video by using a human body key point identification method, selecting the extracted human body key points, and cutting according to the positions of the key points to obtain a plurality of human body parts serving as the input of a local feature extraction module;

performing data preprocessing on the input global and local by using a cutting and overturning mode which is conventional in the field;

the local feature extraction module is constructed by using the local bottleneck structure provided by the invention as a basic module, and processes a plurality of local inputs obtained by the data preprocessing module, and finally gathers to obtain an identification result; the global feature extraction module processes global features to obtain an identification result; the local feature enhancement module, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a global-local action recognition model; the action recognition is performed by the method.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A double-flow global-local action recognition method based on video input is characterized in that: the method specifically comprises the following steps:

2. The dual stream global-local motion recognition method based on video input of claim 1, wherein the specific method of step 1 is as follows:

from each frame, G partial images I can be obtained _local ：

Wherein the crop function represents a part where the crop_box is located, which is cut out from the Input image, the function P (-) represents a data preprocessing operation, and the partial video Input is composed of partial images _local Formalized definition is as follows:

inputted into

3. The dual stream global-local motion recognition method based on video input of claim 1, wherein the specific method of step 2 is as follows:

4. The method for identifying double-flow global-local actions based on video input according to claim 1, wherein the specific method in step 3 is as follows:

then respectively handle

Personal specialAnd (3) carrying out sign image averaging to obtain a local average feature image:

3.2 Time series alignment; the feature L2 is subjected to downsampling in the process of feature extraction _out Compressed to T in the time dimension T ₂ Features of

Compressed to T in the time dimension ₁ Duplicating the frame_size dimension in the time dimension results in

first, a null Mask is created that is identical to the full_c1_out time and space dimensions _empty ，

Finally, G local feature images are integrated into a Mask of one frame according to the relative position of the local feature in the frame in the space dimension, and an attention guiding Mask with the same shape as the feature image of the target full-frame stream is built frame by frame according to the relative positions of the local feature changed in different frames in the time dimension _attention Guide Mask _attention Fusion of G groups of local spatial information understanding representing local network in time dimension。

5. The method for identifying dual stream global-local actions based on video input according to claim 1, wherein said step 4 is specifically as follows:

6. The double-flow global-local action recognition system based on video input is characterized by comprising a recognition object key point feature extraction module, a data preprocessing module and a recognition module;

7. A dual stream global-local motion recognition device based on video input, comprising:

a memory for storing a computer program;

8. A computer readable storage medium storing a computer program, wherein the program, when executed by a processor, implements the multi-local extraction dual stream global-local action recognition method of any one of claims 1 to 5.