CN116311495A - Dual-stream global-local action recognition method, system, equipment and storage medium based on video input - Google Patents

Dual-stream global-local action recognition method, system, equipment and storage medium based on video input Download PDF

Info

Publication number
CN116311495A
CN116311495A CN202310070774.0A CN202310070774A CN116311495A CN 116311495 A CN116311495 A CN 116311495A CN 202310070774 A CN202310070774 A CN 202310070774A CN 116311495 A CN116311495 A CN 116311495A
Authority
CN
China
Prior art keywords
local
global
video
recognition
feature extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310070774.0A
Other languages
Chinese (zh)
Inventor
苗启广
梁思宇
李宇楠
陈绘州
史媛媛
刘如意
盛立杰
刘向增
谢琨
卢子祥
宋建锋
刘林润佳
权义宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202310070774.0A priority Critical patent/CN116311495A/en
Publication of CN116311495A publication Critical patent/CN116311495A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

A method, a system, a device and a storage medium for identifying double-flow global-local actions based on video input, wherein the method comprises the following steps: collecting video input with single recognition object behavior, extracting recognition object feature key points by using a recognition object key point recognition method, selecting key points, and cutting the recognition object behavior video frame by frame to obtain local images of a plurality of corresponding areas; performing data preprocessing on a plurality of local video inputs and global video inputs composed of the local images of the identification objects; training a local feature extraction module and a global feature extraction module by using a local video and an original video input network respectively; adding a local feature enhancement module and a result fusion structure to cooperatively train to obtain a global-local action recognition model; performing action recognition; the system, the device and the medium are used for realizing a double-flow global-local action recognition method for multi-local extraction; the method is simple to operate, and can improve the effect of the action recognition prediction result obtained by the overall double-flow overall-local action recognition method.

Description

Dual-stream global-local action recognition method, system, equipment and storage medium based on video input
Technical Field
The invention belongs to the technical field of video processing and understanding, and particularly relates to a double-flow global-local action recognition method, system, equipment and storage medium based on video input.
Background
The human body behavior recognition technology mainly comprises three aspects of human body target recognition, human body tracking and behavior recognition. Wherein the behavior recognition is based on a higher level computer vision portion of the first two. The robust behavior recognition algorithm is researched to have important theoretical significance and wide application prospect, wherein the robust behavior recognition algorithm comprises intelligent video monitoring and video retrieval. In order to reduce the interference of background redundant information, human body dynamic information in a video is learned, and a plurality of methods are used for identification by fusing information of multiple modes. The global local method comprises the step of intercepting human body parts in the video to conduct behavior recognition. StNet connects the continuous N frames of images in RGB channel dimension as the global representation of video, called hypergraph, obtains local space-time feature in hypergraph, then combines local space-time feature and carries out feature extraction in time dimension to obtain global space-time feature. The attention-based method uses the attention mechanism to emphasize local information in the video as a branch, and performs softmax layer fusion with the global network to fuse global local features.
Patent application CN113761992a discloses a video action recognition method, comprising: acquiring a video; processing an implicit layer of a video input neural network model to obtain an identification object in the video and a motion corresponding to the identification object, wherein the implicit layer of the neural network model comprises a plurality of processing units; based on the identification object in the video and the motion corresponding to the identification object, outputting the action identification result of the video; the method comprises the steps of sequentially extracting spatial features and temporal features in a video which is input by at least one processing unit in the at least one processing unit, combining the spatial features and the temporal features in the video, performing point-by-point convolution operation, and outputting spatial semantic information and temporal semantic information of the video. According to the technical scheme provided by the application, the video processing process has stronger time-space relation coding capability, more meaningful features can be extracted by a smaller number of parameters, and more useful information can be learned from the data set by using a more compact structure. In implementing video processing, only a single processor is required to process the amount of video previously processed by multiple processors
The existing behavior recognition method comprises the steps of processing a single video input, obtaining a prediction result from the processed time and space characteristics, and not deeply mining the space characteristic information of more details in the video image. In addition, some methods use multi-modal fusion to characterize global features in a parallel processing mode of dual-stream and even multi-stream models with the same structure, and neglect a large amount of fine-grained local information. The method of local interception is used to emphasize identifying object local to obtain dynamic information, and global information and some useful fine-grained information are ignored while identifying local. Global local methods based on image stitching and attention mechanisms emphasize local information, but local fine-grained information has been lost in the preprocessing process prior to feature extraction.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a dual-flow global-local motion recognition method, a system, equipment and a storage medium based on video input, which can learn global information of object motion and pay attention to local fine-grained feature information at the same time of recognizing the global information of object motion in video, can obtain local detail information of a recognized object from local video input obtained by local interception operation, and can obtain more dynamic local fine-grained information of video by combining different local features to obtain better global features.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a double-flow global-local action recognition method based on video input specifically comprises the following steps:
step 1, acquiring video input with single recognition object action, extracting recognition object key points in the video by using a recognition object key point recognition method, selecting the extracted recognition object key points, and cutting according to the key point positions to obtain a plurality of recognition object partial images; performing data preprocessing on a plurality of local video inputs and global video inputs formed by a plurality of identification object local images;
step 2, extracting action characteristics in global and local videos by adopting a double-flow global-local action recognition network; the local feature extraction network and the global feature extraction network are used for carrying out feature extraction operation on the input local video and original video data respectively;
step 3, adding a local feature enhancement module, processing an intermediate output feature map containing local information in a local network to obtain an attention guiding mask, and enhancing an intermediate result of the global network in a space dimension and a time dimension by using the attention guiding mask;
and 4, cooperatively training the double-flow network and the enhancement module to obtain a global-local action recognition model, and performing action recognition.
The specific method of the step 1 is as follows:
1.1 Using a key point recognition network to recognize a plurality of key points of the recognition objects from the video of the single recognition object action obtained from the acquisition input, and cutting out a local image taking the key points as the center according to the recognized key points of the recognition objects;
1.2 Identification object key point data is set as J i =(x i ,y i ),i=1...clip size ,J i Representing the position of the key point of the identification object in one image, and setting the local image with the key point as the center as a box with side length len Square crop of (2) box Obtaining clip of each frame size To identify object key point J i The shape being a box at the center len ×box len Is a local image of the center of the keypoint of (c):
crop box =(x i -box len /2,y i -box len /2),(x i +box len /2,y i +box len /2)
from each frame, G partial images I can be obtained local
Figure BDA0004064746450000043
Wherein the crop function represents the part where the crop_box is located cut out from the input image, and the function P (·) represents the data preprocessingOperation, local video Input of local image composition local Formalized definition is as follows:
Figure BDA0004064746450000041
inputted into
Figure BDA0004064746450000042
The specific method of the step 2 is as follows:
2.1 The global feature extraction network extracts action features in the global video to obtain a prediction result;
2.2 A local feature extraction network extracts action features in the local video;
2.2.1 2), dividing a video sequence by taking key points of different parts of G identification objects as centers in the video frame of the local video obtained by preprocessing in the step 2;
2.2.2 Using the result of the step 2.2.1) as input data to perform local network feature extraction: splitting the input data into G groups of local data, respectively extracting network characteristics, and then merging; namely, the input data is regarded as the combination of G groups of local data, and the local modules can respectively process the G groups of local data;
2.2.3 After the network feature extraction in the step 2.2.2) is finished, dividing the data into G groups, respectively obtaining prediction results, and then averaging to obtain final prediction results.
The specific method of the step 3 is as follows:
3.1 Acquiring local average characteristics; the output intermediate feature map of the global network is
Figure BDA0004064746450000051
Figure BDA0004064746450000052
The intermediate feature map of the output of the local feature extraction network is L2 out ,/>
Figure BDA0004064746450000053
Firstly, processing an output feature map of a local feature extraction network to obtain local average features, and dividing the features into groups in the filter dimension;
then respectively handle
Figure BDA0004064746450000054
Averaging the feature maps to obtain a local average feature map:
Figure BDA0004064746450000055
3.2 Time series alignment; the feature L2 is subjected to downsampling in the process of feature extraction out Compressed to T in the time dimension T 2 Feature Full L1out Compressed to T in the time dimension 1 Duplicating the frame_size dimension in the time dimension results in
Figure BDA0004064746450000056
The time dimension of the generated attention guiding mask is consistent with the target feature map of the guided full-frame stream, and the local clip_size average features are aligned in time sequence;
3.3 Creating an attention directing mask: using L2 out_3 A spatiotemporal attention-directing mask is established,
first, establish an AND Full_C1_out Empty Mask with identical time and space dimensions empty
Figure BDA0004064746450000057
Calculating the Shape of the local average feature restored to the guiding mask according to the cut-out frame of the image data in the local video sequence trans ∈R h1×w1
Then, the shape of the local average feature is scaled from 16×16 to h 1 ×w 1
Finally, in the spatial dimension, G local feature images are formed according to the relative positions of the local features in the frameIntegrating into the Mask of a frame, in the time dimension, according to the relative position of the local features changing on different frames, establishing an attention guiding Mask which has the same shape as the feature map of the target full-frame stream frame by frame attention Guide Mask attention Representing fusion of the understanding of the local spatial information of the G group of local networks in the time dimension.
The specific method of the step 4 is as follows:
after the attention guiding mask is obtained, the mask is multiplied by a feature map output by a global network Conv1 element by element point to obtain the guided feature
Figure BDA0004064746450000061
Figure BDA0004064746450000062
After the local feature enhancement module is added, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a final multi-local extraction double-flow global-local action recognition network.
A double-flow global-local action recognition system based on video input comprises a recognition object key point feature extraction module, a data preprocessing module and a recognition module;
the recognition object key point feature extraction module is used for extracting recognition object key points in the video by using a recognition object key point recognition method, selecting the extracted recognition object key points, and cutting out a plurality of recognition object parts according to the key point positions to obtain a plurality of recognition object parts as the input of the local feature extraction module;
the data preprocessing module performs data preprocessing on the input global and local by using a cutting and overturning mode which is conventional in the field.
The identification module comprises a local feature extraction module, a global feature extraction module and a local feature enhancement module; the local feature extraction module is constructed by using the local bottleneck structure provided by the invention as a basic module, and processes a plurality of local inputs obtained by the data preprocessing module, and finally gathers to obtain an identification result; the global feature extraction module processes global features to obtain an identification result; the local feature enhancement module, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a global-local action recognition model; the action recognition is performed by the method.
A dual stream global-local motion recognition device based on video input, comprising:
a memory for storing a computer program;
and the processor is used for realizing the double-flow global-local action recognition method based on the video multi-local extraction in the steps 1 to 4 when executing the computer program.
A computer readable storage medium storing a computer program which, when executed by a processor, implements the multi-local extraction dual stream global-local action recognition method of steps 1-4.
Compared with the prior art, the invention has the following advantages and beneficial effects:
firstly, the method and the device can automatically position a plurality of selected key parts of the identification object in each frame of image in the input video by combining the local interception operation, do not need to manually carry out additional identification object key part labeling work, reduce the requirement on input data, and only need to input the original video. The local video image of the identification object obtained by using the local interception operation can better represent the local details of the identification object compared with the video image fed into the global network.
Secondly, in the double-flow network, the local feature extraction network can process a plurality of processed local input video images of the identification object, so that the time and resource consumption for feature extraction in video identification are reduced, the local video image obtained through local interception operation is predicted after feature extraction, and the method is beneficial to more local detail visual information and can obtain better prediction results;
thirdly, the local feature enhancement module added in the global-local double-flow network of the invention gathers the features of a plurality of key parts of the identification object in the local feature extraction module in the space-time field again to obtain the attention guiding mask. By adding the module, the capacity of the global feature extraction module can be improved, so that the effect of the action recognition prediction result obtained by the overall double-flow global-local action recognition method is improved.
Drawings
Fig. 1 is a flowchart of a dual-stream global-local motion recognition method based on video multi-local extraction according to an embodiment of the invention.
Fig. 2 is a comparison of a global feature extraction network and a local feature extraction network model according to an embodiment of the present invention.
Fig. 3 is a comparison of bottleneck structures used by the global feature extraction network and the local feature extraction network model according to an embodiment of the present invention, fig. 3 (a) is a global ResNeXt bottleneck structure, and fig. 3 (b) is a local bottleneck structure.
Fig. 4 is a flowchart of a method for identifying a dual-flow global local action based on video multi-local extraction according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of a system according to an embodiment of the present invention.
Fig. 6 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.
Detailed Description
In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.
Examples:
as shown in fig. 1, taking a human action behavior as an example, the present embodiment provides a dual-stream global-local action recognition method based on video input, including the following steps:
s1, acquiring video input with single human body action content, extracting human body key points in the video by using a human body key point identification method, selecting the extracted human body key points, and cutting according to the key point positions to obtain a plurality of human body partial images; performing data preprocessing on a plurality of local video inputs and global video inputs formed by a plurality of human body local images;
more specifically:
extracting frame_size frames from the acquired video of single human body action as identification objects, and extracting 7 human body key points of a person in the video by using an HRNet network as a key point identifier: nose tip, left and right shoulders, left and right elbows, and the distal ends of the middle fingers of the left and right hands. The size of the side length box_len of the clipping box is set to be the average value of 1.5 times of the distance from the nose tip of the human body to the left shoulder in all frames of the video, and a square image with the side length box_len is obtained by processing the square image with the key point as the center. In order to meet the setting of 8 partial images input by the partial feature extraction module, the facial image with the tip of the nose as the center is duplicated. And finally obtaining the position information of 8 human body partial images to be intercepted.
And identifying a plurality of human body key points from the video of the single human body action obtained from the acquisition input by using a key point identification network, and cutting out a local image taking the key points as the center according to the identified human body key points. Human body key point data is set as J i =(x i ,y i ),i=1...clip size 。J i Representing the location of a human keypoint in an image. Setting a local image with a key point as a center as a box with a side length len Square crop of (2) box Obtaining clip of each frame size Based on the key points J of human body i The shape of the center is a partial image of the center of the key point of box_len×box_len.
crop box =(x i -box len /2,y i -box len /2),(x i +box len /2,y i +box len /2)
From each frame, G partial images i_local can be obtained:
I_local=P(crop(I k ,crop_box)),I local ∈R 3×128×128
the crop function represents a portion where the crop_box is located, and the function P (·) represents a data preprocessing operation, specifically, adjusting the shape size of the input image to 3×box_len×box_len. The definition of the local video Input input_local formalization of the local image composition is as follows:
Figure BDA0004064746450000101
inputted into
Figure BDA0004064746450000102
As shown in fig. 4, taking an example of a frame of image with 512×512 shape in the input video, the average distance from the nose tip to the left shoulder in the video is 107 pixels, we cut 160×160 parts of the face video information representing the person from the figure, and the local information corresponding to other 7 predefined human body points is the same. Because of the amount of parameters required to control the model, the input of the ResNeXt network is preprocessed by data to become a picture set with the shape of 128×128. Through simple calculation, the global feature extraction network is first-class, and the local information of the frame can only be reserved by 25% after data preprocessing. The local feature extraction network can learn the picture data containing 80% of local detail information. The difference in the amount of information contained in the full frame input and the partial information input can also be visually seen from fig. 4.
S2, extracting action characteristics in global and local videos by adopting a double-flow global-local action recognition network; the local feature extraction network and the global feature extraction network are used for carrying out feature extraction operation on the input local video and original video data respectively;
more specifically:
the global feature extraction network uses a ResNeXt network to extract structures in the global video that include one convolutional layer and four bottleneck blocks containing residual structures, as shown in fig. 3.
The local feature extraction network uses inputs from the original video data, each set of data being a video sequence segmented from video frames centered around a human keypoint. Local feature extraction network architecture design as shown in fig. 3, conv1 layer of ResNeXt is also set as group convolution layer with group number G, and 1×1 convolution and 3×3 convolution in bottleneck structure are also set as group convolution layer with group number G. As can be seen from fig. 3, such a design ensures that the model can learn different sets of local features separately during the learning process. The result of Layer4 obtains 2048-dimensional prediction vector through adaptive avgpool3d Layer, represents the identification result of clip_size local parts, splits it into clip_size× (2048/clip_size), sends it into fc Layer to obtain the prediction result of each part, and finally averages the clip_size group result. Specific parameter information of the global feature extraction network and the local feature extraction network is shown in fig. 2.
S3, adding a local feature enhancement module and a result fusion structure to cooperatively train to obtain a global-local motion recognition model, and using the model to perform motion recognition.
More specifically:
the output of the global network Conv1 is enhanced in the spatial and temporal dimensions using a feature map of the Local network containing intermediate outputs of Local information. The local feature enhancement module is shown in the figure as being divided into three parts: 1. local average features are acquired 2. Time series alignment 3. An attention directing mask is established.
The Conv1 output of the global network is characterized by:
Figure BDA0004064746450000111
the layer2 output of the ResNeXt-Local network is characterized by:
Figure BDA0004064746450000112
firstly, dividing the features into clip_size groups in the filter dimension, and then respectively averaging 512/clip_size feature graphs to obtain a local average feature graph
Figure BDA0004064746450000121
Time series alignment is then required. Since the downsampling is performed during feature extraction, the feature L2_out is compressed from frame_size in the time dimension to +.>
Figure BDA0004064746450000122
The feature full_l1_out is compressed to +_in the time dimension>
Figure BDA0004064746450000123
In order to make the time dimension of the generated attention directing mask coincide with the target feature map of the directed full frame stream, the frame_size dimension is replicated in the time dimension, resulting in +.>
Figure BDA0004064746450000124
Finally, use L2 out_3 A spatiotemporal attention guidance mask is established.
The creation of the spatiotemporal attention-directing mask is divided into three steps. First, a null mask is created which is identical to the full_c1_out time and space dimensions
Figure BDA0004064746450000125
Computing Shape of restoration of local average feature to guide mask based on truncated frame at local image data acquisition trans ∈R h1×w1 . Then, the shape of the local average feature is scaled from 16×16 to h1×w1. Finally, in the spatial dimension, G local feature maps are integrated into the mask of a frame based on the relative position of the local feature in the frame. In the time dimension, according to the relative positions of the local features changed on different frames, establishing an attention guiding Mask which is the same as the shape of the feature map of the target full-frame stream frame by frame attention . Guide Mask attention Representing fusion of the understanding of the local spatial information of the G group of local networks in the time dimension.
S4, after the attention guiding mask is obtained, the mask is multiplied by a feature map output by the global network Conv2 element by element point to obtain the guided feature Att_full_C1_out.
Figure BDA0004064746450000126
After the local feature enhancement module is added, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a final multi-local extraction double-flow global-local action recognition network.
In particular, the network structure in this embodiment may be implemented by using other network structures capable of achieving the same technical effects, for example, the global feature extraction network ResNeXt in fig. 2 and 3 may be any form of feature extraction network, and the Local feature extraction ResNeXt-Local network may be any form of feature extraction network that performs splitting-feature extraction-merging operations on input features.
As shown in fig. 5, a dual-stream global-local motion recognition system based on video input includes a human body key point feature extraction module, a data preprocessing module and a recognition module;
the human body key point feature extraction module extracts human body key points in the video by using a human body key point identification method, selects the extracted human body key points, and cuts out a plurality of human body parts according to the key point positions to be used as the input of the local feature extraction module;
the data preprocessing module performs data preprocessing on the input global and local by using a cutting and overturning mode which is conventional in the field;
the recognition module comprises a local feature extraction module, a global feature extraction module and a local feature enhancement module, wherein the local feature extraction module is constructed by using the local bottleneck structure provided by the invention as a basic module, and processes a plurality of local inputs obtained by the data preprocessing module, and finally gathers to obtain a recognition result; the global feature extraction module processes global features to obtain an identification result; the local feature enhancement module, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a global-local action recognition model; the action recognition is performed by the method.
It should be noted that, in the system, only the division of the functional modules is illustrated, and in practical application, the function allocation may be performed by different functional modules, that is, the internal structure may be divided into different functional modules, so as to perform all or part of the functions described above, and the system is applied to the first person viewing angle action recognition method of the above embodiment.
A dual stream global-local motion recognition device based on video input, comprising: a memory for storing a computer program;
a processor for implementing the dual stream global-local action recognition method based on video multi-local extraction of any one of the steps of claim 1 to claim 5 when executing said computer program.
As shown in fig. 6, a storage medium stores a program that, when executed by a processor, implements the first person perspective action recognition method of the above embodiment, specifically:
extracting human body key points in the video by using a human body key point identification method, selecting the extracted human body key points, and cutting according to the positions of the key points to obtain a plurality of human body parts serving as the input of a local feature extraction module;
performing data preprocessing on the input global and local by using a cutting and overturning mode which is conventional in the field;
the local feature extraction module is constructed by using the local bottleneck structure provided by the invention as a basic module, and processes a plurality of local inputs obtained by the data preprocessing module, and finally gathers to obtain an identification result; the global feature extraction module processes global features to obtain an identification result; the local feature enhancement module, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a global-local action recognition model; the action recognition is performed by the method.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (8)

1. A double-flow global-local action recognition method based on video input is characterized in that: the method specifically comprises the following steps:
step 1, acquiring video input with single recognition object action, extracting recognition object key points in the video by using a recognition object key point recognition method, selecting the extracted recognition object key points, and cutting according to the key point positions to obtain a plurality of recognition object partial images; performing data preprocessing on a plurality of local video inputs and global video inputs formed by a plurality of identification object local images;
step 2, extracting action characteristics in global and local videos by adopting a double-flow global-local action recognition network; the local feature extraction network and the global feature extraction network are used for carrying out feature extraction operation on the input local video and original video data respectively;
step 3, adding a local feature enhancement module, processing an intermediate output feature map containing local information in a local network to obtain an attention guiding mask, and enhancing an intermediate result of the global network in a space dimension and a time dimension by using the attention guiding mask;
and 4, cooperatively training the double-flow network and the enhancement module to obtain a global-local action recognition model, and performing action recognition.
2. The dual stream global-local motion recognition method based on video input of claim 1, wherein the specific method of step 1 is as follows:
1.1 Using a key point recognition network to recognize a plurality of key points of the recognition objects from the video of the single recognition object action obtained from the acquisition input, and cutting out a local image taking the key points as the center according to the recognized key points of the recognition objects;
1.2 Identification object key point data is set as J i =(x i ,y i ),i=1...clip size ,J i Representing the position of the key point of the identification object in one image, and setting the local image with the key point as the center as a box with side length len Square crop of (2) box Obtaining clip of each frame size To identify object key point J i The shape being a box at the center len ×box len Is a local image of the center of the keypoint of (c):
crop box =(x i -box len /2,y i -box len /2),(x i +box len /2,y i +box len /2)
from each frame, G partial images I can be obtained local
Figure FDA0004064746440000021
Wherein the crop function represents a part where the crop_box is located, which is cut out from the Input image, the function P (-) represents a data preprocessing operation, and the partial video Input is composed of partial images local Formalized definition is as follows:
Figure FDA0004064746440000022
inputted into
Figure FDA0004064746440000023
3. The dual stream global-local motion recognition method based on video input of claim 1, wherein the specific method of step 2 is as follows:
2.1 The global feature extraction network extracts action features in the global video to obtain a prediction result;
2.2 A local feature extraction network extracts action features in the local video;
2.2.1 2), dividing a video sequence by taking key points of different parts of G identification objects as centers in the video frame of the local video obtained by preprocessing in the step 2;
2.2.2 Using the result of the step 2.2.1) as input data to perform local network feature extraction: splitting the input data into G groups of local data, respectively extracting network characteristics, and then merging; namely, the input data is regarded as the combination of G groups of local data, and the local modules can respectively process the G groups of local data;
2.2.3 After the network feature extraction in the step 2.2.2) is finished, dividing the data into G groups, respectively obtaining prediction results, and then averaging to obtain final prediction results.
4. The method for identifying double-flow global-local actions based on video input according to claim 1, wherein the specific method in step 3 is as follows:
3.1 Acquiring local average characteristics; the output intermediate feature map of the global network is
Figure FDA0004064746440000031
Figure FDA0004064746440000032
The intermediate feature map of the output of the local feature extraction network is L2 out ,/>
Figure FDA0004064746440000033
Firstly, processing an output feature map of a local feature extraction network to obtain local average features, and dividing the features into groups in the filter dimension;
then respectively handle
Figure FDA0004064746440000034
Personal specialAnd (3) carrying out sign image averaging to obtain a local average feature image:
Figure FDA0004064746440000035
3.2 Time series alignment; the feature L2 is subjected to downsampling in the process of feature extraction out Compressed to T in the time dimension T 2 Features of
Figure FDA0004064746440000036
Compressed to T in the time dimension 1 Duplicating the frame_size dimension in the time dimension results in
Figure FDA0004064746440000037
The time dimension of the generated attention guiding mask is consistent with the target feature map of the guided full-frame stream, and the local clip_size average features are aligned in time sequence;
3.3 Creating an attention directing mask: using L2 out_3 A spatiotemporal attention-directing mask is established,
first, a null Mask is created that is identical to the full_c1_out time and space dimensions empty
Figure FDA0004064746440000038
Calculating the Shape of the local average feature restored to the guiding mask according to the cut-out frame of the image data in the local video sequence trans ∈R h1×w1
Then, the shape of the local average feature is scaled from 16×16 to h 1 ×w 1
Finally, G local feature images are integrated into a Mask of one frame according to the relative position of the local feature in the frame in the space dimension, and an attention guiding Mask with the same shape as the feature image of the target full-frame stream is built frame by frame according to the relative positions of the local feature changed in different frames in the time dimension attention Guide Mask attention Fusion of G groups of local spatial information understanding representing local network in time dimension。
5. The method for identifying dual stream global-local actions based on video input according to claim 1, wherein said step 4 is specifically as follows:
after the attention guiding mask is obtained, the mask is multiplied by a feature map output by a global network Conv1 element by element point to obtain the guided feature
Figure FDA0004064746440000041
Figure FDA0004064746440000042
After the local feature enhancement module is added, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a final multi-local extraction double-flow global-local action recognition network.
6. The double-flow global-local action recognition system based on video input is characterized by comprising a recognition object key point feature extraction module, a data preprocessing module and a recognition module;
the recognition object key point feature extraction module is used for extracting recognition object key points in the video by using a recognition object key point recognition method, selecting the extracted recognition object key points, and cutting out a plurality of recognition object parts according to the key point positions to obtain a plurality of recognition object parts as the input of the local feature extraction module;
the data preprocessing module performs data preprocessing on the input global and local by using a cutting and overturning mode which is conventional in the field;
the identification module comprises a local feature extraction module, a global feature extraction module and a local feature enhancement module; the local feature extraction module is constructed by using the local bottleneck structure provided by the invention as a basic module, and processes a plurality of local inputs obtained by the data preprocessing module, and finally gathers to obtain an identification result; the global feature extraction module processes global features to obtain an identification result; the local feature enhancement module, the global feature extraction module and the local feature extraction module are trained cooperatively to obtain a global-local action recognition model; the action recognition is performed by the method.
7. A dual stream global-local motion recognition device based on video input, comprising:
a memory for storing a computer program;
a processor for implementing the dual stream global-local action recognition method based on video multi-local extraction of any one of the steps of claim 1 to claim 5 when executing said computer program.
8. A computer readable storage medium storing a computer program, wherein the program, when executed by a processor, implements the multi-local extraction dual stream global-local action recognition method of any one of claims 1 to 5.
CN202310070774.0A 2023-01-19 2023-01-19 Dual-stream global-local action recognition method, system, equipment and storage medium based on video input Pending CN116311495A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310070774.0A CN116311495A (en) 2023-01-19 2023-01-19 Dual-stream global-local action recognition method, system, equipment and storage medium based on video input

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310070774.0A CN116311495A (en) 2023-01-19 2023-01-19 Dual-stream global-local action recognition method, system, equipment and storage medium based on video input

Publications (1)

Publication Number Publication Date
CN116311495A true CN116311495A (en) 2023-06-23

Family

ID=86836836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310070774.0A Pending CN116311495A (en) 2023-01-19 2023-01-19 Dual-stream global-local action recognition method, system, equipment and storage medium based on video input

Country Status (1)

Country Link
CN (1) CN116311495A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117078817A (en) * 2023-08-23 2023-11-17 北京百度网讯科技有限公司 Video generation method, device, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117078817A (en) * 2023-08-23 2023-11-17 北京百度网讯科技有限公司 Video generation method, device, equipment and medium

Similar Documents

Publication Publication Date Title
WO2023056889A1 (en) Model training and scene recognition method and apparatus, device, and medium
CN112131908B (en) Action recognition method, device, storage medium and equipment based on double-flow network
CN109919830B (en) Method for restoring image with reference eye based on aesthetic evaluation
CN111814719A (en) Skeleton behavior identification method based on 3D space-time diagram convolution
Li et al. GaitSlice: A gait recognition model based on spatio-temporal slice features
CN111639571B (en) Video action recognition method based on contour convolution neural network
CN111241963B (en) First person view video interactive behavior identification method based on interactive modeling
CN115457624B (en) Face recognition method, device, equipment and medium for wearing mask by cross fusion of local face features and whole face features
CN113128368B (en) Method, device and system for detecting character interaction relationship
CN111080670A (en) Image extraction method, device, equipment and storage medium
CN116311495A (en) Dual-stream global-local action recognition method, system, equipment and storage medium based on video input
CN112749671A (en) Human behavior recognition method based on video
Ghaffar Facial emotions recognition using convolutional neural net
CN114492634A (en) Fine-grained equipment image classification and identification method and system
Agrawal et al. Multimodal vision transformers with forced attention for behavior analysis
Salem et al. A novel face inpainting approach based on guided deep learning
CN117409476A (en) Gait recognition method based on event camera
Ni et al. Diverse local facial behaviors learning from enhanced expression flow for microexpression recognition
Yao et al. FoSp: Focus and separation network for early smoke segmentation
CN110674675A (en) Pedestrian face anti-fraud method
CN116129051A (en) Three-dimensional human body posture estimation method and system based on graph and attention interleaving
CN115909497A (en) Human body posture recognition method and device
CN114724058A (en) Method for extracting key frames of fusion characteristic motion video based on human body posture recognition
Lin et al. Face detection algorithm based on multi-orientation gabor filters and feature fusion
Zhu et al. Deepfake detection via inter-frame inconsistency recomposition and enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination