CN115836330A - Action identification method based on depth residual error network and related product - Google Patents
Action identification method based on depth residual error network and related product Download PDFInfo
- Publication number
- CN115836330A CN115836330A CN202180048575.9A CN202180048575A CN115836330A CN 115836330 A CN115836330 A CN 115836330A CN 202180048575 A CN202180048575 A CN 202180048575A CN 115836330 A CN115836330 A CN 115836330A
- Authority
- CN
- China
- Prior art keywords
- feature
- convolution module
- convolution
- motion
- cascade
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 230000009471 action Effects 0.000 title claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 103
- 230000006870 function Effects 0.000 claims description 15
- 230000015654 memory Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 7
- 239000010410 layer Substances 0.000 description 85
- 230000003287 optical effect Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 12
- 238000013461 design Methods 0.000 description 9
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 8
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000002679 ablation Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 229910052739 hydrogen Inorganic materials 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000001364 causal effect Effects 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
Provided are a motion recognition method based on a deep residual error network and a related product. A first convolution module in the at least one convolution module receives the video segment as an input. Traversing at least one convolution module by performing the following operations: processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature; shifting the motion-related feature by one step length along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature; and acquiring the output of the convolution module as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. Identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module. Thereby reducing the model size and reducing computational costs.
Description
Technical Field
The present disclosure relates to the field of neural network technologies, and in particular, to a method for motion recognition based on a deep residual error network and a related product.
Background
Convolution Neural Networks (CNNs) and the resurgence of large scale labeled data sets have led to unprecedented advances in image classification using end-to-end trainable networks. However, video-based human motion recognition cannot be achieved based on CNN features alone. How to effectively model the time information, i.e. identify the correlation and causal relationship of time, is a basic challenge.
At present, classical research branches focused on modeling motion by artificially manually designed optical flow have emerged and the dual-stream approach using optical flow modalities and Red Green Blue (RGB) respectively in separate streams is one of the most successful architectures. However, optical flow computation is costly.
Disclosure of Invention
Embodiments provide a motion recognition method based on a depth residual error network and a related product to reduce the size of a model of the depth residual error network for video-based human motion recognition and to reduce computational cost.
In a first aspect, a method for motion recognition based on a deep residual error network is provided. The action identification method based on the depth residual error network is applied to a depth residual error network system comprising at least one convolution module, each convolution module in the at least one convolution module comprises at least one first convolution layer, each first convolution layer in the at least one first convolution layer is provided with at least one-dimensional filter and at least one two-dimensional filter, and the method comprises the following contents. A first convolution module in the at least one convolution module receives the video segment as an input. Traversing the at least one convolution module by performing the following operations: processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature; shifting the motion-related feature by one step length along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature; obtaining the output of the convolution module based on the appearance correlation characteristic, the motion correlation characteristic and the residual error characteristic; and taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. Identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module.
In a second aspect, a motion recognition apparatus based on a deep residual error network is provided. The action recognition device based on the depth residual error network is applied to a depth residual error network system comprising at least one convolution module, each convolution module in the at least one convolution module comprises at least one first convolution layer, each first convolution layer in the at least one first convolution layer is provided with at least one-dimensional filter and at least one two-dimensional filter, and the action recognition device comprises a receiving unit, a processing unit and a recognition unit. The receiving unit is configured to receive the video segment as input at a first convolution module of the at least one convolution module. The process is for traversing the at least one convolution module by performing the following operations: processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature; shifting the motion-related feature by one step length along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature; obtaining the output of the convolution module based on the appearance related feature, the motion related feature and the residual error feature; and taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. The identification unit is configured to identify at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module.
In a third aspect, a terminal device is provided that includes a processor, a memory for storing one or more programs. The one or more programs are for execution by the processor, the one or more programs including instructions for performing some or all of the operations of the method described in the first aspect.
In a fourth aspect, a non-transitory computer-readable storage medium for storing a computer program for electronic data exchange is provided. The computer program described above comprises instructions for carrying out part or all of the operations in the method described in the first aspect.
In a fifth aspect, a computer program product is provided that includes a non-transitory computer readable storage medium storing a computer program. The computer program described above may cause a computer to perform some or all of the operations of the method described in the first aspect.
In an embodiment of the present application, a new depth residual error network system is provided. The depth residual error network system comprises at least one convolution module, each convolution module in the at least one convolution module comprises at least one first convolution layer, each first convolution layer in the at least one first convolution layer is provided with at least one-dimensional filter and at least one two-dimensional filter, and the action identification method based on the depth residual error network is applied to the depth residual error network system. A first convolution module in the at least one convolution module receives the video segment as an input. Traversing the at least one convolution module by performing the following operations: processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature; shifting the motion-related feature by one step length along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature; obtaining the output of the convolution module based on the appearance related feature, the motion related feature and the residual error feature; and taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. Identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module. Therefore, a new convolution module is proposed in the present application, which can be considered as a pseudo three-dimensional convolution module, in which the standard 3D filter in the related art is decoupled to form a parallel two-dimensional spatial filter and one-dimensional time-domain filter. By using separable two-dimensional and one-dimensional convolutions instead of three-dimensional convolutions, the model size and computational cost are greatly reduced. Furthermore, the 2D convolution and the 1D convolution are placed in different paths, so that the appearance-related feature and the motion-related feature can be modeled differently.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic diagram of an RGB frame example (top) and a residual frame example (bottom).
Fig. 2 is a schematic diagram of an exemplary detailed design of a depth residual network.
Fig. 3 is a schematic flow diagram of a method of action recognition based on a depth residual error network according to an embodiment.
Fig. 4 is a schematic flow diagram of a method for action recognition based on a deep residual network according to an embodiment.
Fig. 5 is a schematic diagram of an exemplary detailed design of the proposed convolution module.
Fig. 6 is a schematic configuration diagram of a motion recognition apparatus based on a depth residual error network according to an embodiment.
Fig. 7 is a schematic configuration diagram of a terminal device according to the embodiment.
Detailed Description
In order to make the technical solutions of the embodiments better understood by those skilled in the art, the technical solutions in the embodiments will be clearly and completely described below with reference to the drawings in the embodiments, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the present embodiments without making any creative effort shall fall within the protection scope of the present application.
The terms "first," "second," and "third," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. Such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The terminal devices involved in the embodiments of the present invention may include various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem with wireless communication functions, as well as various forms of User Equipment (UE), mobile Stations (MS), mobile terminals, and so on. For convenience of description, the above-mentioned devices are collectively referred to as terminal devices.
To facilitate a better understanding of the embodiments of the present application, the related art referred to in the present application will be briefly described below.
The resurgence of Convolutional Neural Networks (CNNs) and large scale labeled data sets has led to an unprecedented advancement in image classification using end-to-end trainable networks. However, video-based human motion recognition cannot be achieved based on CNN features alone. How to effectively model time information, i.e., identify correlations and causal relationships of time, is a fundamental challenge. There is a classical branch of research focused on modeling motion by artificially manually designed optical flow. In the context of deep learning, the dual-stream approach, which uses the optical flow modality and the RGB modality, respectively, in separate streams, is one of the most successful architectures. However, this architecture is not satisfactory in terms of methods considering that optical flow computation is costly, and the dual-stream approach using the optical flow modality and the RGB modality, respectively, in separate streams is generally not able to perform end-to-end learning along with optical flow.
In the present application, it is proposed that residual frames, i.e. differences between adjacent RGB frames, are used in combination with the RGB modality for video-based human motion recognition as an optional "lightweight" motion representation. The reason why the residual frame can be used for video-based human motion recognition together with the RGB modality can be expressed as follows. On the one hand, neighboring RGB frames largely share information of static objects and background information, so residual frames usually retain mainly motion-specific features, as shown in fig. 1. Fig. 1 shows an RGB frame example (top) and a residual frame example (bottom). As can be seen from fig. 1, RGB frames contain rich appearance information, while residual frames retain mainly significant motion information. On the other hand, the computational cost of residual frames is negligible compared to other motion representations such as optical flow.
With the recent trend of developing three-dimensional (3D) convolution models for video classification, a new and effective convolution module is provided in the present application, which can be considered as a pseudo 3D convolution module, in which the original 3D convolution is decoupled into 2D and 1D convolutions. Furthermore, to further enhance the motion features and reduce the computational cost, residual information in the feature space, i.e., residual features representing the differences between temporally adjacent CNN features, may be utilized. Furthermore, the self-attention mechanism can be used to recalibrate the appearance-related features and the motion-related features based on the importance of the appearance-related features and the motion-related features to the final task, so as to further reduce the size and the calculation cost of the model and avoid the influence of the features which are not important to the final task on the calculation accuracy of the depth residual network system.
The action recognition method based on the deep residual error network and the related product are high in calculation efficiency and have improved performance. In particular, the proposed residual frames (or residual features) are a lightweight alternative to other motion representations (e.g. optical flow), and the new convolution module also helps to significantly reduce the computational cost. In addition, experiments have demonstrated that the use of residual frames can significantly improve the accuracy of motion recognition, as will be described in detail below.
In order for those skilled in the art to better understand the concept of residual features, the concept of residual frames is first introduced.
Suppose a video segment x ∈ R T×H×W×C Where, T, H, W respectively denote the length, height and width of each frame, and C denotes the number of channels. The residual frame may be determined by subtracting the desired frame from the residual frameSubtracting the reference frameIs formed with a time stamp t 1 And t 2 The step size in between is denoted s. More formally, a residual frame can be defined as:
since neighboring video frames have significant similarity in static information, residual frames typically do not contain background information and appearance information of objects, but retain significant motion-related information. Therefore, the residual frame can be considered as a good source for extracting motion features. Furthermore, the computational cost of the residual frame can be significantly reduced compared to other motion representations such as optical flow.
In reality, the actions and activities contained in the video may be complex and may involve different movement speeds or movement durations. To account for this uncertainty, successive residual frames may be stacked to form a residual segment, which may be defined as:
the residual clips can capture fast motion on the spatial axis and slow/long duration motion on the temporal axis. Thus, the residual segment, in which motion information of short duration and long duration can be extracted at the same time, is suitable for 3D convolution.
However, since the object appearance and background scene may also provide important information for recognizing motion, the residual frame alone may not be sufficient to solve the human motion recognition problem. For example, in the case of eye makeup and lipstick, the movements of the makeup and lipstick are similar, but the positions of the movements are different, one of which occurs around the eyes and the other around the lips. Therefore, it is necessary to perform motion recognition using both the RGB frame and the residual frame. To this end, a new convolution module is provided, which is a pseudo-3D convolution module with the capability to process RGB frames and residual frames simultaneously.
The embodiments of the present application will be described in detail below.
Fig. 2 is a schematic diagram of an exemplary detailed design of a depth residual network. As shown in fig. 2, the depth residual network may include at least: an input layer, at least one convolutional layer, a pooling layer, at least one fully-connected layer, and an output layer. The depth residual network is used for motion recognition based on the video segment as input. The depth residual error network system in the present application is a system based on the depth residual error network shown in fig. 2.
Fig. 3 is a schematic flow diagram of a method of action recognition based on a depth residual error network according to an embodiment. The action recognition method based on the depth residual error network is applied to a depth residual error network system comprising at least one convolution module, each convolution module in the at least one convolution module comprises at least one first convolution layer, and each first convolution layer in the at least one first convolution layer is provided with at least one-dimensional filter and at least one two-dimensional filter. As shown in fig. 3, the method includes the following.
A first convolution module of the at least one convolution module receives the video segment as an input 302.
In particular, a video clip x ∈ R T×H×W×C Can be received as an input to a depth residual network system, where T, H, W represents the length, height and width of each frame, respectively, and C represents the number of channels. For an RGB frame, the number of channels is 3, these three channels representing red (R), green (G) and blue (B), respectively. When there are no other convolution modules or layers before a first one of the at least one convolution modules, the video segment is passed to and received as an input at the first one of the at least one convolution modules. When there are other layers before the first of the at least one convolution module, the video segment may be processed by the other layers first and then passed to the first of the at least one convolution module and further received as input at the first of the at least one convolution module.
The size of the filter may be expressed as T × H × W, where T represents the time dimension, H represents the height in the spatial dimension, and W represents the width in the spatial dimension. The 1D filter may be denoted as T × 1 × 1, where T is greater than 1, and the 2D filter may be denoted as 1 × H × W, where at least one of H and W is greater than 1. The 1D filter is used for convolution in the time dimension and the 2D filter is used for convolution in the space dimension.
304, traversing the at least one convolution module by performing the following operations.
3042, the input is processed using a one-dimensional filter to obtain motion-related features and a two-dimensional filter to obtain appearance-related features.
A new convolution module is provided in the present application that can be considered a pseudo 3D convolution module. For the convolution module, the related art 3D filter is decoupled to form a parallel 2D spatial filter and a 1D time domain filter. Therefore, by using separable 2D convolution and 1D convolution instead of 3D convolution, the model size and computational cost are greatly reduced, which is in line with the latest trend of efficient 3D network development. Furthermore, the 2D convolution and the 1D convolution are placed in different paths, so that different modeling can be performed for appearance features and motion features.
3044 shifting the motion-related feature by one step along the time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature.
In particular, when modeling motion, the idea of residual frames extends from the pixel level to the feature level. Assume that the output characteristic from the 1D time-domain convolution is f m ∈R T’×H’×W’×C, The output feature of the 1D time domain convolution is shifted along the time dimension by one step, e.g. step 1, and then by a step from the original motion-related feature f m (t) subtracting the shifted motion-related feature f m (t + 1), residual features may be generatedIt can be defined as:
3046 obtaining the output of the convolution module based on the appearance-related feature, the motion-related feature, and the residual feature.
Three features were created after the pseudo-3D convolutionWherein f is s Is the output of a 2D convolution that preserves appearance information, f m Is the output of the 1D convolution, f m Andthe distinctive moving structure is preserved.
3048, the output of the convolution is used as the input of the next (i.e., subsequent) convolution module until the last convolution module of the at least one convolution module is traversed.
And 306, identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module.
In an embodiment of the present application, a new depth residual error network system is provided. The depth residual error network system comprises at least one convolution module, each convolution module in the at least one convolution module comprises at least one first convolution layer, each first convolution layer in the at least one first convolution layer is provided with at least one-dimensional filter and at least one two-dimensional filter, and the action identification method based on the depth residual error network is applied to the depth residual error network system. A first convolution module in the at least one convolution module receives the video segment as an input. Traversing the at least one convolution module by performing the following operations: processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature; shifting the motion-related feature by one step length along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature; obtaining the output of the convolution module based on the appearance related feature, the motion related feature and the residual error feature; and taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. Identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module. Therefore, a new convolution module is proposed in the present application, which can be considered as a pseudo three-dimensional convolution module, in which the standard 3D filter in the related art is decoupled to form a parallel two-dimensional spatial filter and one-dimensional time-domain filter. By using separable two-dimensional and one-dimensional convolutions instead of three-dimensional convolutions, the model size and computational cost are greatly reduced. Furthermore, the 2D convolution and the 1D convolution are placed in different paths, so that the appearance-related feature and the motion-related feature can be modeled differently.
In one embodiment, the output of the convolution module is obtained by: obtaining a cascade feature by cascading the motion-related feature, the residual feature, and the appearance-related feature in a channel dimension; and determining the cascade characteristic as the output of the convolution module.
To facilitate efficient fusion of appearance features and motion features, the output features of the pseudo-3D convolution may be concatenated in channel dimensions to obtain a concatenated feature, which may be defined as:
In one embodiment, the output of the convolution module is obtained by: obtaining a cascade feature by cascading the motion related feature, the residual feature and the appearance related feature in a channel dimension; acquiring a channel attention mask based on the cascade characteristic; and acquiring attention features as the output of the convolution module based on the channel attention mask and the cascade features.
To facilitate efficient fusion of appearance-related features and motion-related features, a channel self-attention mechanism may further be applied to recalibrate the output features. In particular, the output features may be concatenated in the channel dimension to obtain a concatenated feature, which may be defined as:
wherein the symbolsRepresenting a cascade. Due to f m ∈R T’×H’×W’×C’ ,f s ∈R T’×H’×W’×C’ ,Then after concatenation, f ∈ R T’×H’×W’×3C’ 。
Then, a channel attention mask M can be obtained based on the above-described cascade feature att . In one embodiment, each convolution module of the at least one convolution module further includes a full link layer, and the channel attention mask may be obtained based on the cascade feature as follows. Global pooling is performed on the above-mentioned cascade feature, and a pooled cascade feature is obtained, which may be denoted as pool (f). And multiplying the pooled cascade features by the weight matrix of the full connection layer to obtain weighted cascade features, wherein the weighted cascade features can be expressed as Wpool (f). And adding the weighted cascade characteristic and the bias to obtain a biased cascade characteristic, wherein the biased cascade characteristic can be represented as Wpool (f) + b. The channel attention mask, which may be expressed as σ (Wpool (f) + b), is obtained by processing the offset cascade feature using a Sigmoid function. Therefore, the above channel is filledGravity mask M att Can be expressed as:
M att =σ(Wpool(f)+b),
wherein,representing a weight matrix parameterized by a single-layer neural network (i.e. the fully-connected layer of the convolution module described above),represents the bias term, pool is a global pooling operation that averages the cascading features f across space and time, and σ represents the sigmoid function. Through the channel attention mask, a dynamic feature is achieved that is conditioned on the input features and the re-weighted channels based on the importance of the input features to the final task.
After obtaining the channel attention mask, the attention feature may be further obtained based on the channel attention mask and the cascade feature. In one embodiment, attention mask M is focused by performing the above described channel att And performing channel-by-channel multiplication with the cascade feature f to obtain the attention feature. In another embodiment, to further improve robustness, the above attention feature is obtained by: by implementing the above-described channel attention mask M att Performing channel-by-channel multiplication with the cascade feature f to obtain an intermediate feature; adding the intermediate feature to the cascade feature to obtain the attention feature, wherein the attention feature is defined as:
f att =f⊙M att +f
wherein the symbol "<" > indicates a channel-by-channel multiplication. Thus, residual concatenation can be achieved in the proposed convolution module.
In one embodiment, each of said at least one convolution modules further comprises a second convolution layer preceding said at least one first convolution layer, said second convolution layer comprising a three-dimensional filter having a size of 1 x 1, traversing said at least one convolution module further comprising: processing the input using the three-dimensional filter to reduce the dimensionality of the input before processing the input using the one-dimensional filter to obtain motion-related features and processing the input using the two-dimensional filter to obtain appearance-related features. Processing the input using a one-dimensional filter to obtain a motion-related feature, and processing the input using a two-dimensional filter to obtain an appearance-related feature comprises: processing the dimensionality reduced input using the one-dimensional filter to obtain the motion-related feature, and processing the dimensionality reduced input using the two-dimensional filter to obtain the appearance-related feature.
Specifically, the output of the previous convolution module may be processed by a 3D filter of size 1 × 1 × 1 before being processed by the 1D filter and the 2D filter of the next convolution module, so that the number of channels of the output of the previous convolution module may be reduced, thereby reducing the dimensionality of the output of the previous convolution module (i.e., the input of the next convolution module). The 3D filter may include at least one 3D filter. The number of 3D filters may be determined based on the desired dimensions. For example, to reduce dimensionality, the number of 3D filters is less than the number of channels of the output of the previous convolution module; to recover the dimensionality, the number of 3D filters is equal to the number of channels of the output of the previous convolution module.
In one embodiment, each of the at least one convolution module further includes a third convolution layer located after the at least one first convolution layer and the second convolution layer, the third convolution layer including a three-dimensional filter having a size of 1 × 1 × 1, and obtaining the output of the convolution module further includes: processing said concatenated features using said three-dimensional filter to increase the dimensionality of said concatenated features; and taking the cascade characteristic after the dimension is increased as the output of the convolution module.
In one embodiment, each of said at least one convolution module further comprises a third convolution layer located after said at least one first convolution layer and said second convolution layer, said third convolution layer comprising a three-dimensional filter having a size of 1 × 1 × 1, said obtaining said output of said convolution module further comprising: processing said attention feature with said three-dimensional filter to increase a dimension of said attention feature; and using the attention feature after the dimension raising as an output of the convolution module.
Specifically, the top 1 × 1 × 1 convolution and the bottom 1 × 1 × 1 convolution are used to reduce and restore dimensionality.
Further experiments were performed in which variants of ResNet-60 were developed by replacing all bottleneck blocks with the proposed convolution module.
For example, the performance of the technical solution proposed in the present application is evaluated using UCF101 dataset, where UCF101 is a dataset consisting of 13,330 videos containing 101 action classes. For all experiments, the first (top-1) and top five (top-5) accuracy of the split-1 validation set were reported.
In particular, evaluation of the effectiveness of different data modalities is performed by training motion classifiers using separate RGB frames, separate residual frames, and using combined inputs of the RGB frames and the residual frames, respectively. Secondly, the effect of residual frame step size on motion recognition was investigated. Finally, ablation studies were also conducted to investigate the effectiveness of the various components in the proposed convolution module.
Performance comparisons of different data modalities.
Table 1 shows the action recognition performance of various combinations of input modalities and network architectures. For experiments using only RGB frames or residual frames, only one stream in the data layer (i.e., the first convolutional layer) is retained, but the number of channels is doubled for fair comparison. As can be seen from table 1, using only the residual frame is about 3% higher in top-1 accuracy and top-5 accuracy than using only the RGB frame, indicating that the residual frame does contain significant motion information important for motion recognition. When RGB frames and residual frames are used in different streams, top-1 accuracy is further improved by 2.6% (from 83.0% to 85.6%), indicating that the two data modalities retain complementary information. Notably, using the convolution module provided in the present application not only significantly reduces floating point calculations per second (flops) (from 163G to 40G), but also provides better performance (85.6% v.s.85.0% in terms of top-1 accuracy) than using standard 3D convolution in the related art.
Table 1: and comparing the corresponding performances of different input modes and network architectures.
The influence of the step size s.
When generating residual frames, the step size s can be changed to capture motion features at different time scales. However, it is not clear what the optimal step size of the motion recognition task is. Therefore, the effect of the step size was investigated and the results are shown in Table 2. Experiments were performed for three settings, respectively, where the input data are residual frames of step size s =1, 2, 4, respectively. As can be seen from table 2, the classification accuracy decreases as the step size increases. We suspect that motion will result in spatial displacement of the same object between two frames, and that using a large step size may result in a mismatch between the motion representations.
Table 2: and comparing the performances corresponding to different residual frame step lengths s.
Step size | Val top-1 | Val top-5 |
s=1 | 83.0% | 98.3% |
s=2 | 82.7% | 97.1% |
s=4 | 80.2% | 96.0% |
And (4) ablation research.
In order to verify the effectiveness of the different components in the proposed convolution module, an ablation study was performed. Without loss of generality, the model is trained with the combined input of RGB frame and residual frame (s = 1). Table 3 shows a comparison of performance for various convolution module settings. As shown in table 3, removing the self-attention mechanism resulted in a 1.8% decrease in top-1 accuracy (from 85.6% to 83.8%). Meanwhile, when the residual information in the feature space is ignored, the performance also drops from 85.6% to 83.5%. If the self-attention mechanism and the residual features are simultaneously cancelled, the top-1 accuracy rate is further reduced to 82.1%. These results demonstrate that the channel self-attention mechanism and residual features are effective for improving motion recognition performance.
Table 3: performance comparisons corresponding to different convolution module settings
Method | Val top-1 | Val top-5 |
Convolution module without self-attention and residual features | 82.1% | 97.9% |
Convolution module without self-attention | 83.8% | 98.4% |
Convolution module without residual features | 83.5% | 98.1% |
Convolution module | 85.6% | 99.2% |
Fig. 4 is a schematic flow diagram of a method for action recognition based on a deep residual network according to an embodiment. The action recognition method based on the depth residual error network is applied to a depth residual error network system, the depth residual error network system comprises at least one convolution module, each convolution module in the at least one convolution module comprises at least one first convolution layer, and each first convolution layer in the at least one first convolution layer is provided with at least one-dimensional filter and at least one two-dimensional filter. As shown in fig. 4, the method includes the following.
A first convolution module, 402, of the at least one convolution module receives as input the video segment.
404, traversing the at least one convolution module by performing the following operations.
40502, the input is processed using a three-dimensional filter to reduce the dimensionality of the input.
40504, processing the input with the reduced dimensions using the one-dimensional filter to obtain motion-related features, and processing the input with the reduced dimensions using the two-dimensional filter to obtain appearance-related features.
40506, shifting the motion-related feature by one step along the time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature.
40508, the cascade feature is obtained by cascading the motion-related feature, the residual feature, and the appearance-related feature in a channel dimension.
40410, performing global pooling on the cascade feature to obtain the pooled cascade feature.
40412, multiplying the pooled cascade feature by the weight matrix of the full connection layer to obtain a weighted cascade feature.
40414, the above weighted cascade feature is added to the offset to obtain the offset cascade feature.
40416, the channel attention mask is obtained by processing the above-mentioned offset cascade feature using Sigmoid function.
40418, intermediate features are obtained by performing a channel-by-channel multiplication between the channel attention mask and the cascaded features.
40430 adding the intermediate feature to the cascade feature to obtain the attention feature.
40422, processing the attention feature using the three-dimensional filter to increase a dimension of the attention feature.
40424, the attention feature after the dimension raising is used as the output of the convolution module.
40426, and using the output of the convolution as the input of the next convolution module until the last convolution module of the at least one convolution module is traversed.
406, identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module.
Fig. 5 shows an exemplary detailed design of the proposed convolution module. The proposed convolution module can be integrated into any standard CNN architecture, such as ResNet. To process RGB frames and residual frames simultaneously, the original data layer (i.e. the first convolution layer) is modified into two streams with parallel building blocks (building blocks) that output appearance-related features and motion-related features, respectively, one for each modality. The resulting features from the two streams are concatenated and passed to the next layer (i.e., the subsequent convolution module). In an exemplary design of the proposed convolution module, a 2D filter of size 3 × 3 and a 1D filter of size 1 × 1 are exemplified. In particular, the size of the filter may be represented as T × H × W, where T represents the time dimension, H represents the height in the space dimension, and W represents the width in the space dimension.
Table 4 shows an exemplary detailed design of a depth residual network system. As shown in table 4, the deep residual network system includes four convolution modules, denoted res 2, res 3, res 4, and res 5, respectively. In an exemplary design of a depth residual network system, the size of the filter may be made of { T × S } 2 And C denotes to indicate temporal, spatial, and channel sizes, and specifically, in table 4, a 2D filter of a size of 3 × 3 and a 1D filter of a size of 1 × 1 are exemplified for explanation.
Table 4: exemplary detailed design of a deep residual network system.
In an embodiment of the present application, a new depth residual error network system is provided. The depth residual error network system comprises at least one convolution module, each convolution module in the at least one convolution module comprises at least one first convolution layer, each first convolution layer in the at least one first convolution layer is provided with at least one-dimensional filter and at least one two-dimensional filter, and the action identification method based on the depth residual error network is applied to the depth residual error network system. A first convolution module in the at least one convolution module receives the video segment as an input. Traversing the at least one convolution module by performing the following operations: processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature; shifting the motion-related feature by one step length along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature; obtaining the output of the convolution module based on the appearance related feature, the motion related feature and the residual error feature; and taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. Identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module. Therefore, a new convolution module is proposed in the present application, which can be considered as a pseudo three-dimensional convolution module, in which the standard 3D filter in the related art is decoupled to form a parallel two-dimensional spatial filter and one-dimensional time-domain filter. By using separable two-dimensional and one-dimensional convolutions instead of three-dimensional convolutions, the model size and computational cost are greatly reduced. Furthermore, the 2D convolution and the 1D convolution are placed in different paths, so that the appearance-related feature and the motion-related feature can be modeled differently.
The above operations may refer to a detailed description of a network training operation in the action recognition method based on the deep residual network, which will not be described here.
The above description has introduced the solution of the embodiment of the present invention mainly from the perspective of the method-side implementation process. It is understood that, in order to implement the above functions, the electronic device includes a corresponding hardware structure and/or software module for performing each function. Those of skill in the art will readily appreciate that the present invention can be implemented in hardware or a combination of hardware and computer software, with the exemplary elements and algorithm steps described in connection with the embodiments disclosed herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The first wireless headset according to the embodiment of the present invention may be divided into functional units according to the above method, for example, each functional unit may be divided for each function, or two or more functions may be integrated into one processing unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present invention is schematic, and is only a logic function division, and there may be another division manner in actual implementation.
Fig. 6 is a schematic configuration diagram of a motion recognition apparatus based on a depth residual error network according to an embodiment. The apparatus is applied to a depth residual network system comprising at least one convolution module, each of said at least one convolution module comprising at least one first convolution layer, each of said at least one first convolution layer having at least one-dimensional filter and at least one two-dimensional filter. As shown in fig. 6, the motion recognition apparatus based on the deep residual error network includes a receiving unit 602, a processing unit 604, and a recognition unit 606.
The receiving unit 602 is configured to receive the video segment as an input at a first convolution module of the at least one convolution module.
The processing unit 604 is configured to traverse the at least one convolution module by performing the following operations: processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature; shifting the motion-related feature by one step length along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature; obtaining the output of the convolution module based on the appearance related feature, the motion related feature and the residual error feature; and taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed.
The identifying unit 606 identifies at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module.
In an embodiment of the present application, a new depth residual error network system is provided. The depth residual network system comprises at least one convolution module, each convolution module in the convolution module comprises at least one first convolution layer, each first convolution layer in the convolution layer is provided with at least one-dimensional filter and at least one two-dimensional filter, and the action identification method based on the depth residual network is applied to the depth residual network system. A first convolution module in the at least one convolution module receives the video segment as an input. Traversing the at least one convolution module by performing the following operations: processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature; shifting the motion-related feature by one step length along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature; obtaining the output of the convolution module based on the appearance related feature, the motion related feature and the residual error feature; and taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. Identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module. Therefore, a new convolution module is proposed in the present application, which can be considered as a pseudo three-dimensional convolution module, in which the standard 3D filter in the related art is decoupled to form a parallel two-dimensional spatial filter and one-dimensional time-domain filter. By using separable two-dimensional and one-dimensional convolutions instead of three-dimensional convolutions, the model size and computational cost are greatly reduced. Furthermore, the 2D convolution and the 1D convolution are placed in different paths, so that appearance-related features and motion-related features can be modeled differently.
In an embodiment, in terms of obtaining the output of the convolution module, the processing unit 604 is specifically configured to: obtaining a cascade feature by cascading the motion-related feature, the residual feature and the appearance-related feature in a channel dimension; and determining the cascade characteristic as the output of the convolution module.
In an embodiment, in terms of obtaining the output of the convolution module, the processing unit 604 is specifically configured to: obtaining a cascade feature by cascading the motion-related feature, the residual feature and the appearance-related feature in a channel dimension; acquiring a channel attention mask based on the cascade characteristic; and acquiring attention features as the output of the convolution module based on the channel attention mask and the cascade features.
In an embodiment, in obtaining the attention feature based on the channel attention mask and the cascade feature, the processing unit 604 is specifically configured to: the attention feature is obtained by performing a channel-by-channel multiplication between the channel attention mask and the cascade feature.
In an embodiment, in obtaining the attention feature based on the channel attention mask and the cascade feature, the processing unit 604 is specifically configured to: obtaining an intermediate feature by performing a channel-by-channel multiplication between the channel attention mask and the cascade feature; and adding the intermediate feature and the cascade feature to obtain the attention feature.
In an embodiment, each convolution module of the at least one convolution module further includes a full link layer, and in obtaining the channel attention mask based on the cascade feature, the processing unit 604 is configured to: performing global pooling on the cascade features to obtain pooled cascade features; multiplying the pooled cascade features by the weight matrix of the full connection layer to obtain weighted cascade features; adding the weighted cascade characteristic and the bias to obtain a biased cascade characteristic; and processing the bias cascade feature by using a Sigmoid function to obtain the channel attention mask.
In one embodiment, each convolution module of the at least one convolution module further includes a second convolution layer located before the at least one first convolution layer, the second convolution layer including a three-dimensional filter having a size of 1 × 1 × 1, and the processing unit 604 is further configured to, in traversing the at least one convolution module: processing the input using the three-dimensional filter to reduce the dimensionality of the input prior to processing the input using the one-dimensional filter to obtain motion-related features and processing the input using the two-dimensional filter to obtain appearance-related features; wherein, in terms of processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature, the processing unit 604 is specifically configured to: processing the reduced dimension input using the one-dimensional filter to obtain the motion-related feature, and processing the reduced dimension input using the two-dimensional filter to obtain the appearance-related feature.
In one embodiment, each convolution module of said at least one convolution module further includes a third convolution layer located after said at least one first convolution layer and said second convolution layer, said third convolution layer including a three-dimensional filter having a size of 1 × 1 × 1, said processing unit 604 being further configured to, in obtaining said output of said convolution module: processing said concatenated features using said three-dimensional filter to increase the dimensionality of said concatenated features; and taking the cascade characteristic after the dimension is increased as the output of the convolution module.
In one embodiment, each convolution module of said at least one convolution module further includes a third convolution layer located after said at least one first convolution layer and said second convolution layer, said third convolution layer including a three-dimensional filter having a size of 1 × 1 × 1, said processing unit 604 being further configured to, in obtaining said output of said convolution module: processing said attention feature with said three-dimensional filter to increase a dimension of said attention feature; and using the attention feature after the dimension raising as an output of the convolution module.
Fig. 7 is a schematic configuration diagram of a terminal device according to the embodiment. As shown in fig. 7, the terminal device 700 includes a processor 701, a memory 702, a communication interface 703, and one or more programs 704 stored in the memory 702 and executed by the processor 701. The one or more programs 704 include instructions for executing a deep residual network (ResNet) system. The depth residual network system includes at least one convolution module, each of the at least one convolution module including at least one first convolution layer, each of the at least one first convolution layers having at least one-dimensional (1D) filter and at least one two-dimensional (2D) filter. The one or more programs 704 include instructions for performing the following operations.
A first convolution module in the at least one convolution module receives the video segment as an input. Traversing the at least one convolution module by performing the following operations: processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature; shifting the motion-related feature by one step along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature; obtaining the output of the convolution module based on the appearance related feature, the motion related feature and the residual error feature; and taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. Identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module.
In an embodiment of the present application, a new depth residual error network system is provided. The depth residual error network system comprises at least one convolution module, each convolution module in the at least one convolution module comprises at least one first convolution layer, each first convolution layer in the at least one first convolution layer is provided with at least one-dimensional filter and at least one two-dimensional filter, and the action identification method based on the depth residual error network is applied to the depth residual error network system. A first convolution module in the at least one convolution module receives the video segment as an input. Traversing the at least one convolution module by performing the following operations: processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature; shifting the motion-related feature by one step length along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature; obtaining the output of the convolution module based on the appearance related feature, the motion related feature and the residual error feature; and taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed. Identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module. Therefore, a new convolution module is proposed in the present application, which can be considered as a pseudo three-dimensional convolution module, in which the standard 3D filter in the related art is decoupled to form a parallel two-dimensional spatial filter and one-dimensional time-domain filter. By using separable two-dimensional and one-dimensional convolutions instead of three-dimensional convolutions, the model size and computational cost are greatly reduced. Furthermore, the 2D convolution and the 1D convolution are placed in different paths, so that the appearance-related feature and the motion-related feature can be modeled differently.
In one embodiment, the one or more programs 704 include instructions for performing the following operations in obtaining the output of the convolution module. Obtaining a cascade feature by cascading the motion-related feature, the residual feature and the appearance-related feature in a channel dimension; and determining the cascade characteristic as the output of the convolution module.
In one embodiment, the one or more programs 704 include instructions for performing the following operations in obtaining the output of the convolution module. Obtaining a cascade feature by cascading the motion-related feature, the residual feature and the appearance-related feature in a channel dimension; acquiring a channel attention mask based on the cascade characteristic; and acquiring attention features as the output of the convolution module based on the channel attention mask and the cascade features.
In one embodiment, in obtaining the attention feature based on the channel attention mask and the cascade feature, the one or more programs 704 include instructions for performing the following operations. The attention feature is obtained by performing a channel-by-channel multiplication between the channel attention mask and the cascade feature.
In one embodiment, in obtaining the attention feature based on the channel attention mask and the cascade feature, the one or more programs 704 include instructions for performing the following operations. Obtaining an intermediate feature by performing a channel-by-channel multiplication between the channel attention mask and the cascade feature; and adding the intermediate feature and the cascade feature to obtain the attention feature.
In one embodiment, each of the at least one convolution module further includes a fully connected layer, and the one or more programs 704 include instructions for performing the following in obtaining the channel attention mask based on the cascade feature. Performing global pooling aiming at the cascade features to obtain the cascade features after the pooling; multiplying the pooled cascade features by the weight matrix of the full connection layer to obtain weighted cascade features; adding the weighted cascade characteristic and the bias to obtain a biased cascade characteristic; and processing the bias cascade feature by using a Sigmoid function to obtain the channel attention mask.
In one embodiment, each of the at least one convolution module is located on a second convolution layer preceding the at least one first convolution layer, the second convolution layer including a three-dimensional filter having a size of 1 x 1, the one or more programs 704 further including instructions for traversing the at least one convolution module. Prior to processing the input using the one-dimensional filter to obtain motion-related features and processing the input using the two-dimensional filter to obtain appearance-related features: processing the input using the three-dimensional filter to reduce the dimensionality of the input; wherein the one or more programs 704 include instructions for performing the following in processing the input using a one-dimensional filter to obtain a motion-related feature and processing the input using a two-dimensional filter to obtain an appearance-related feature. Processing the reduced dimension input using the one-dimensional filter to obtain the motion-related feature, and processing the reduced dimension input using the two-dimensional filter to obtain the appearance-related feature.
In one embodiment, each of said at least one convolution module further comprises a third convolution layer positioned after said at least one first convolution layer and said second convolution layer, said third convolution layer comprising a three-dimensional filter having a size of 1 x 1, said one or more programs 704 further comprising instructions for performing the following operations in obtaining said output of said convolution module. Processing said concatenated features using said three-dimensional filter to increase the dimensionality of said concatenated features; and taking the cascade characteristic after the dimension is increased as the output of the convolution module.
In one embodiment, each of the at least one convolution module further includes a third convolution layer positioned after the at least one first convolution layer and the second convolution layer, the third convolution layer including a three-dimensional filter having a size of 1 x 1, the one or more programs 704 including instructions for performing the following operations in obtaining the output of the convolution module. Processing said attention feature with said three-dimensional filter to increase a dimension of said attention feature; and using the attention feature after the dimension raising as an output of the convolution module.
A non-transitory computer storage medium is also provided. The non-transitory computer storage medium is configured to store a program that, when executed, is operable to perform some or all of the operations of the deep residual network based action recognition method described in the above-described method embodiments.
A computer program product is also provided. The computer program product includes a non-transitory computer readable storage medium storing a computer program. The computer program may cause a computer to perform some or all of the operations of the deep residual network based action recognition method described in the above method embodiments.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no particular act or module is required by the invention
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above methods according to the embodiments of the present invention. And the aforementioned memory comprises: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.
Embodiments of the present application have been described above in detail, and specific examples have been used herein to describe the principles and implementations of the present application. The above description of the embodiments is only intended to facilitate the understanding of the method and core ideas of the present application. Meanwhile, the specific implementation and the application range can be modified by those skilled in the art according to the idea of the present application. In general, nothing in this specification should be construed as a limitation on the present application.
Claims (20)
1. A method for motion recognition based on a depth residual error network, applied to a depth residual error network system including at least one convolution module, each convolution module of the at least one convolution module including at least one first convolution layer, each first convolution layer of the at least one first convolution layer having at least one-dimensional filter and at least one two-dimensional filter, the method comprising: receiving as input a video segment at a first of the at least one convolution modules;
traversing the at least one convolution module by performing the following operations:
processing the input using a one-dimensional filter to obtain motion-related features and processing the input using a two-dimensional filter to obtain appearance-related features;
shifting the motion-related feature by one step along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature;
obtaining an output of the convolution module based on the appearance-related feature, the motion-related feature, and the residual feature;
taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed;
identifying at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module.
2. The method of claim 1, wherein obtaining the output of the convolution module comprises:
obtaining a cascade feature by cascading the motion-related feature, the residual feature, and the appearance-related feature in a channel dimension; and
determining the concatenated features as the output of the convolution module.
3. The method of claim 1, wherein obtaining the output of the convolution module comprises:
obtaining a cascade feature by cascading the motion-related feature, the residual feature, and the appearance-related feature in a channel dimension;
acquiring a channel attention mask based on the cascade feature; and
based on the channel attention mask and the cascade feature, acquiring an attention feature as the output of the convolution module.
4. The method of claim 3, wherein obtaining the attention feature based on the channel attention mask and the cascade feature comprises:
the attention feature is obtained by performing a channel-by-channel multiplication between the channel attention mask and the cascade feature.
5. The method of claim 3, wherein obtaining the attention feature based on the channel attention mask and the cascade feature comprises:
obtaining an intermediate feature by performing a channel-by-channel multiplication between the channel attention mask and the cascade feature; and
and adding the intermediate feature and the cascade feature to obtain the attention feature.
6. The method of any of claims 3 to 5, wherein each of the at least one convolution module further comprises a fully-connected layer, and wherein obtaining the channel attention mask based on the concatenated features comprises:
performing global pooling on the cascade features to obtain pooled cascade features;
multiplying the pooled cascade features by the weight matrix of the full connection layer to obtain weighted cascade features;
adding the weighted cascade feature and the bias to obtain a biased cascade feature; and
and processing the bias cascade feature by using a Sigmoid function to obtain the channel attention mask.
7. The method of any of claims 1 to 6, wherein each of the at least one convolution module further comprises a second convolution layer located before the at least one first convolution layer, the second convolution layer comprising a three-dimensional filter having a size of 1 x 1, traversing the at least one convolution module further comprising:
prior to processing the input using the one-dimensional filter to obtain motion-related features and processing the input using the two-dimensional filter to obtain appearance-related features:
processing the input using the three-dimensional filter to reduce a dimension of the input;
wherein processing the input using a one-dimensional filter to obtain motion-related features and processing the input using a two-dimensional filter to obtain appearance-related features comprises:
processing the reduced dimension input using the one-dimensional filter to obtain the motion-related feature, and processing the reduced dimension input using the two-dimensional filter to obtain the appearance-related feature.
8. The method of claim 7, wherein each of the at least one convolution module further includes a third convolution layer positioned after the at least one first convolution layer and the second convolution layer, the third convolution layer including a three-dimensional filter having a size of 1 x 1, the obtaining the output of the convolution module further comprising:
processing the concatenated features using the three-dimensional filter to increase a dimension of the concatenated features; and
and taking the cascade feature after the dimension is raised as the output of the convolution module.
9. The method of claim 7, wherein each of the at least one convolution module further includes a third convolution layer positioned after the at least one first convolution layer and the second convolution layer, the third convolution layer including a three-dimensional filter having a size of 1 x 1, the obtaining the output of the convolution module further comprising:
processing the attention feature with the three-dimensional filter to increase a dimension of the attention feature; and
using the attention feature after upscaling as the output of the convolution module.
10. An action recognition device based on a depth residual error network, applied to a depth residual error network system comprising at least one convolution module, each convolution module of the at least one convolution module comprising at least one first convolution layer, each first convolution layer of the at least one first convolution layer having at least one-dimensional filter and at least one two-dimensional filter, the device comprising:
a receiving unit for receiving as input a video segment at a first of said at least one convolution module;
a processing unit to traverse the at least one convolution module by performing the following operations:
processing the input using a one-dimensional filter to obtain motion-related features and processing the input using a two-dimensional filter to obtain appearance-related features;
shifting the motion-related feature by one step along a time dimension, and subtracting the shifted motion-related feature from the motion-related feature to obtain a residual feature;
obtaining an output of the convolution module based on the appearance-related feature, the motion-related feature, and the residual feature;
taking the output of the convolution as the input of the next convolution module until the last convolution module in the at least one convolution module is traversed;
an identification unit to identify at least one action included in the video segment based on an output of a last convolution module of the at least one convolution module.
11. The apparatus according to claim 10, wherein, in obtaining the output of the convolution module, the processing unit is specifically configured to:
obtaining a cascade feature by cascading the motion-related feature, the residual feature, and the appearance-related feature in a channel dimension; and
determining the concatenated features as the output of the convolution module.
12. The apparatus according to claim 10, wherein, in obtaining the output of the convolution module, the processing unit is specifically configured to:
obtaining a cascade feature by cascading the motion-related feature, the residual feature, and the appearance-related feature in a channel dimension;
acquiring a channel attention mask based on the cascade feature; and
based on the channel attention mask and the concatenated features, acquiring attention features as the output of the convolution module.
13. The apparatus of claim 12, wherein, in obtaining the attention feature based on the channel attention mask and the cascade feature, the processing unit is specifically configured to:
the attention feature is obtained by performing a channel-by-channel multiplication between the channel attention mask and the cascade feature.
14. The apparatus of claim 12, wherein, in obtaining an attention feature based on the channel attention mask and the cascade feature, the processing unit is specifically configured to:
obtaining an intermediate feature by performing a channel-by-channel multiplication between the channel attention mask and the cascade feature; and
and adding the intermediate feature and the cascade feature to obtain the attention feature.
15. The apparatus according to any one of claims 12 to 14, wherein each of the at least one convolution module further comprises a fully-connected layer, the processing unit being specifically configured to, in obtaining the channel attention mask based on the concatenated features:
performing global pooling on the cascade feature to obtain the pooled cascade feature;
multiplying the pooled cascade features by the weight matrix of the full connection layer to obtain weighted cascade features;
adding the weighted cascade feature and the bias to obtain a biased cascade feature; and
and processing the bias cascade feature by using a Sigmoid function to obtain the channel attention mask.
16. The apparatus of any of claims 10 to 15, wherein each of the at least one convolution module further comprises a second convolution layer preceding the at least one first convolution layer, the second convolution layer comprising a three-dimensional filter having a size of 1 x 1, the processing unit further to, in traversing the at least one convolution module: processing the input using the three-dimensional filter to reduce the dimensionality of the input prior to processing the input using the one-dimensional filter to obtain motion-related features and processing the input using the two-dimensional filter to obtain appearance-related features;
wherein, in processing the input using a one-dimensional filter to obtain motion-related features and processing the input using a two-dimensional filter to obtain appearance-related features, the processing unit is to: processing the reduced dimension input using the one-dimensional filter to obtain the motion-related feature, and processing the reduced dimension input using the two-dimensional filter to obtain the appearance-related feature.
17. The apparatus of claim 16, wherein each of the at least one convolution module further comprises a third convolution layer positioned after the at least one first convolution layer and the second convolution layer, the third convolution layer comprising a three-dimensional filter having a size of 1 x 1, the processing unit further to, in obtaining the output of the convolution module:
processing the cascaded feature using the three-dimensional filter to increase a dimension of the cascaded feature; and
and taking the cascade feature after the dimension is raised as the output of the convolution module.
18. The apparatus of claim 16, wherein each of the at least one convolution module further comprises a third convolution layer positioned after the at least one first convolution layer and the second convolution layer, the third convolution layer comprising a three-dimensional filter having a size of 1 x 1, the processing unit further to, in obtaining the output of the convolution module:
processing the attention feature with the three-dimensional filter to increase a dimension of the attention feature; and
using the attention feature after upscaling as the output of the convolution module.
19. A terminal device comprising a processor, a memory for storing one or more programs, wherein the one or more programs are for execution by the processor and comprise instructions for performing the method of any of claims 1 to 9.
20. A non-transitory computer readable storage medium storing a computer program for electronic data exchange, which when executed, causes a computer to perform the method according to any one of claims 1 to 9.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202063050604P | 2020-07-10 | 2020-07-10 | |
US63/050,604 | 2020-07-10 | ||
PCT/CN2021/105520 WO2022007954A1 (en) | 2020-07-10 | 2021-07-09 | Method for action recognition based on deep residual network, and related products |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115836330A true CN115836330A (en) | 2023-03-21 |
Family
ID=79552270
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202180048575.9A Pending CN115836330A (en) | 2020-07-10 | 2021-07-09 | Action identification method based on depth residual error network and related product |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115836330A (en) |
WO (1) | WO2022007954A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117958813A (en) * | 2024-03-28 | 2024-05-03 | 北京科技大学 | ECG (ECG) identity recognition method, system and equipment based on attention depth residual error network |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114758265A (en) * | 2022-03-08 | 2022-07-15 | 深圳集智数字科技有限公司 | Escalator operation state identification method and device, electronic equipment and storage medium |
CN118470014B (en) * | 2024-07-11 | 2024-09-24 | 南京信息工程大学 | Industrial anomaly detection method and system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10706350B1 (en) * | 2017-08-11 | 2020-07-07 | Facebook, Inc. | Video analysis using convolutional networks |
CA3016953A1 (en) * | 2017-09-07 | 2019-03-07 | Comcast Cable Communications, Llc | Relevant motion detection in video |
CN109670529A (en) * | 2018-11-14 | 2019-04-23 | 天津大学 | A kind of separable decomposition residual error modularity for quick semantic segmentation |
CN109635790A (en) * | 2019-01-28 | 2019-04-16 | 杭州电子科技大学 | A kind of pedestrian's abnormal behaviour recognition methods based on 3D convolution |
-
2021
- 2021-07-09 WO PCT/CN2021/105520 patent/WO2022007954A1/en active Application Filing
- 2021-07-09 CN CN202180048575.9A patent/CN115836330A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117958813A (en) * | 2024-03-28 | 2024-05-03 | 北京科技大学 | ECG (ECG) identity recognition method, system and equipment based on attention depth residual error network |
Also Published As
Publication number | Publication date |
---|---|
WO2022007954A1 (en) | 2022-01-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111192292B (en) | Target tracking method and related equipment based on attention mechanism and twin network | |
CN109522874B (en) | Human body action recognition method and device, terminal equipment and storage medium | |
CN112597941B (en) | Face recognition method and device and electronic equipment | |
CN115836330A (en) | Action identification method based on depth residual error network and related product | |
CN109978077B (en) | Visual recognition method, device and system and storage medium | |
CN111445418A (en) | Image defogging method and device and computer equipment | |
CN114529982B (en) | Lightweight human body posture estimation method and system based on streaming attention | |
WO2023174098A1 (en) | Real-time gesture detection method and apparatus | |
CN116363261B (en) | Training method of image editing model, image editing method and device | |
CN107832794A (en) | A kind of convolutional neural networks generation method, the recognition methods of car system and computing device | |
CN114170623B (en) | Character interaction detection equipment, method and device thereof and readable storage medium | |
CN114519877A (en) | Face recognition method, face recognition device, computer equipment and storage medium | |
WO2023142550A1 (en) | Abnormal event detection method and apparatus, computer device, storage medium, computer program, and computer program product | |
CN114238904A (en) | Identity recognition method, and training method and device of two-channel hyper-resolution model | |
CN110991298A (en) | Image processing method and device, storage medium and electronic device | |
Wang et al. | An interconnected feature pyramid networks for object detection | |
Salem et al. | Semantic image inpainting using self-learning encoder-decoder and adversarial loss | |
Li et al. | Robust foreground segmentation based on two effective background models | |
Liu et al. | Student behavior recognition from heterogeneous view perception in class based on 3-D multiscale residual dense network for the analysis of case teaching | |
Oh et al. | Intrinsic two-dimensional local structures for micro-expression recognition | |
CN112580395A (en) | Depth information-based 3D face living body recognition method, system, device and medium | |
CN112633260B (en) | Video motion classification method and device, readable storage medium and equipment | |
CN112132253B (en) | 3D action recognition method, device, computer readable storage medium and equipment | |
Kim et al. | An end-to-end face parsing model using channel and spatial attentions | |
CN114973424A (en) | Feature extraction model training method, hand action recognition method, device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |