CN116665308B - Double interaction space-time feature extraction method - Google Patents
Double interaction space-time feature extraction method Download PDFInfo
- Publication number
- CN116665308B CN116665308B CN202310741806.5A CN202310741806A CN116665308B CN 116665308 B CN116665308 B CN 116665308B CN 202310741806 A CN202310741806 A CN 202310741806A CN 116665308 B CN116665308 B CN 116665308B
- Authority
- CN
- China
- Prior art keywords
- time
- space
- feature
- double interaction
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 54
- 238000000605 extraction Methods 0.000 title claims abstract description 23
- 238000000034 method Methods 0.000 claims abstract description 39
- 238000010586 diagram Methods 0.000 claims abstract description 24
- 238000011176 pooling Methods 0.000 claims abstract description 21
- 230000009471 action Effects 0.000 claims abstract description 20
- 238000007499 fusion processing Methods 0.000 claims abstract description 5
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 16
- 230000004913 activation Effects 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 2
- 208000011580 syndromic disease Diseases 0.000 claims description 2
- 238000012546 transfer Methods 0.000 claims description 2
- 238000005259 measurement Methods 0.000 claims 1
- 230000008447 perception Effects 0.000 claims 1
- 230000033001 locomotion Effects 0.000 description 10
- 238000011160 research Methods 0.000 description 5
- 210000000988 bone and bone Anatomy 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Abstract
The invention discloses a double interaction space-time feature extraction method, and relates to the technical field of machine vision. The method comprises the following steps: preprocessing skeleton data of a dataset, and extracting double interaction action categories to obtain action tensors; extracting double interaction space-time characteristics through a space-time diagram convolution network, and capturing global and local information; performing feature fusion processing on the feature tensor through STCP based on three-branch pooling to obtain a fine-granularity double interaction space-time feature tensor; the final feature tensor is used for helping the network convergence through the full connection layer and the Softmax layer so as to output the double interaction category, and the method has the advantage of high recognition precision.
Description
Technical Field
The invention relates to the technical field of machine vision, in particular to a double interaction space-time feature extraction method based on a transducer and multi-scale position sensing.
Background
In the field of Computer Vision (CV) research with human focus, human motion recognition (Human Action Recognition, HAR) task has become an important research topic in the field of Computer Vision due to its wide application in many fields such as man-machine interaction, smart home, autopilot, virtual reality, etc. At present, single person behavior recognition research based on videos is relatively more, and interaction behavior recognition research based on double persons is still in an exploration stage. Compared with the single-person action, the double-person interaction behavior recognition not only needs to deal with the problems of illumination change, scene switching, camera visual angle conversion and the like, but also considers the problems of relative relation change, limb shielding, space-time relation change and the like between two persons in the double-person interaction process. Therefore, double interactive behavior recognition is still a challenging problem in the field of computer vision, and how to effectively extract features and build a reasonable motion recognition model is always the focus of research of related researchers at home and abroad.
The traditional motion recognition mainly comprises a feature extraction part and a classifier, and people manually design features to pertinently extract the features of the pictures. However, with the development of motion recognition, more motion data expression forms are developed from two-dimensional plane diagrams to three-dimensional skeleton data, so that the classification of single motions is further increased, the interactive recognition of double motions and even group motions is further increased, and the scene of motion recognition is also more and more complex. With the development of deep learning, models of neural networks, particularly deep networks, have been widely successful in complex motion recognition. To create a skeleton diagram representing a two-person relationship, liu Xing et al propose to represent a single skeleton and an interactive relationship skeleton, respectively, in a coordinate system using a method of relative views. Pei Xiaomin et al propose to use cameras as the center of coordinates to find euclidean distances of the single and double skeletons themselves and the interaction joints, respectively, to represent double skeleton features. Li Jianan et al propose constructing knowledge-given graphs, knowledge-learning graphs, and natural connection graphs to learn interactions with minimal prior knowledge. Zhu L et al propose constructing a binary relationship interaction graph to generate a relationship adjacency matrix to model double interaction. Yoshiki Ito et al propose that intra-and inter-construct maps be input to a multi-flow network, respectively, to extract interactions. However, none of these methods take into account the influence of long-range joint characteristic information and long-range dependence on recognition accuracy, and the problem that fine local joint information is ignored.
Disclosure of Invention
The technical problem to be solved by the invention is how to provide a double interaction space-time feature extraction method with high recognition accuracy based on a transducer and multi-scale position sensing.
In order to solve the technical problems, the invention adopts the following technical scheme: a double interaction space-time feature extraction method comprises the following steps:
s1: preprocessing skeleton data of a dataset, and extracting double interaction action categories to obtain action tensors;
s2: extracting double interaction space-time characteristics through a space-time diagram convolution network, and capturing global and local information;
s3: performing feature fusion processing on the feature tensor through STCP based on three-branch pooling to obtain a fine-granularity double interaction space-time feature tensor;
s4: and the finally obtained characteristic tensor helps the network to converge through the full connection layer and the Softmax layer so as to output the double interaction category.
The further technical proposal is that: constructing a double interaction space feature extraction module combining a transducer and a light space diagram convolution, and extracting double interaction space features; and constructing a multi-scale position sensing time graph convolution module which has a larger time receptive field and focuses on important joint position information, and extracting the double interaction time characteristics.
The further technical proposal is that: performing feature fusion processing on the feature tensor by using the STCP module based on three-branch pooling; the module comprises three branches of space, time and channels for processing the feature tensor, and a more accurate feature map is obtained by a three-branch parallel method and a method for fusing features by adopting cascading.
The beneficial effects of adopting above-mentioned technical scheme to produce lie in: in the method, in the space-time feature extraction process, the combination of local information and global information is realized, fine important joint details are captured, the accuracy of identifying double interaction actions by a reference model is improved, and the proposed module has high embedding property and can be conveniently embedded into other network models.
Drawings
The invention will be described in further detail with reference to the drawings and the detailed description.
FIG. 1 is a flow chart of a method according to an embodiment of the invention
FIG. 2 is a schematic block diagram of a transducer-based spatial feature extraction module in a method according to an embodiment of the present invention;
FIG. 3 is a schematic block diagram of a multi-scale location-aware temporal feature extraction module according to an embodiment of the present invention;
fig. 4 is a schematic block diagram of an STCP attention module in the method according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
As shown in fig. 1, the embodiment of the invention discloses a double interaction space-time feature extraction method, which comprises the following steps:
s1: preprocessing skeleton data of a dataset, and extracting double interaction action categories so as to obtain action tensors;
s2: extracting double interaction space-time characteristics through a space-time diagram convolution network, and capturing global and local information;
the double interaction space-time feature extraction method comprises the steps of extracting double interaction space features based on a transform space diagram convolution and extracting double interaction time features based on a multi-scale position sensing time diagram convolution through a space-time diagram convolution network, so that deep extraction of the space-time features is realized, and local information and global information are captured.
As shown in fig. 2, the embodiment of the invention further discloses a transducer-based space map convolution module for extracting the spatial features of the bone joint points. Firstly, carrying out 1×1 convolution on an input skeleton diagram, introducing more nonlinear factors, carrying out preliminary double interaction space feature extraction on an input vector by utilizing light space diagram convolution, and then normalizing the feature vector by a batch processing layer Batch Normalization. The process is defined as:
(1)
(2)
in the middle ofFor giving up adjacency matrix>Normalizing (I/O)>And->Representing input-output characteristics,/->Representing a graph distance metric function up to 2,/for>Representing a batch normalization layer;
the encoder is a transducer encoder which enters a transducer immediately, the number of layers of the encoder is defined as 2, the module consists of a multi-head attention mechanism, a feedforward neural network, a normalization layer and residual error connection, the most core of the encoder part is the multi-head attention module, a single sub-attention mechanism can be split into a plurality of subspaces, the sub-attention mechanism is executed on each subspace, so that the characteristics of different layers and different angles can be captured better, global modeling can be carried out on the spatial characteristics, and different weights can be distributed to the characteristic diagram in a self-adaptive mode; the feed-forward neural network sublayer consists of two linear transforms and an activation function, wherein the first linear transform converts the input vector into an intermediate representation vector and the second linear transform converts the intermediate representation vector into a final representation vector; the residual connection and normalization layer is used for accelerating model convergence and improving model expression capacity, the residual connection can enable the model to be trained more easily, gradient disappearance and gradient explosion are avoided, the normalization layer can accelerate model convergence, and meanwhile robustness and generalization capacity of the model are improved. In addition, an additional residual error connection is adopted for the whole module, so that the over fitting of the model is prevented, the network parameter number is reduced, and the time complexity is reduced. The process is defined as:
(3)
(4)
(5)
(6)
wherein:for inputting information +.>For content information->For the information itself +.>The attention moment array is converted into standard state distribution, < >>Normalization is achieved. />The input to each layer of neurons is translated into a mean variance,representing the input vector +.>Representing input vector +.>Representing the final characteristics of the output through the encoder,
as shown in fig. 3, the embodiment of the invention also discloses a convolution module based on the multi-scale position sensing map for extracting the time characteristics of the bone joint points. Firstly, 4 parallel time convolution branches are carried out on the extracted space feature diagram, and each branch starts with convolution of 1 multiplied by 1; then pass through batch layer and ReLU activation function pairNormalizing the feature map; the first two branches are then convolved with 2 3 x 1 times and 2 different syndromes are applied to fuse features between different channels to obtain a multi-scale time-receptive field; while the third branch extracts the most significant feature information in successive frames through a 3 x 1 max pooling layer; the last branch contains a residual connection to maintain the gradient during back propagation; the four branches are subjected to multi-scale feature fusion through product operation; residual connection is added to the outer layer of the multi-scale convolution to help the network to converge rapidly, and the multi-scale convolution is combined with the multi-scale features through weighted summation operation; taking as input a multi-scale temporal feature, using a pooled convolution kernel of two spatial dimensionsOr->Each channel is encoded along a horizontal coordinate and a vertical coordinate, respectively, to generate a pair of feature maps with direction sensing capability. The process is defined as:
(7)
(8)
in the method, in the process of the invention,and->Represents->The height in the individual channels is +.>Width is +.>Output of->Indicate->Characteristic tensor of the channel.
Concatenating and transmitting the generated fusion profile to a shared 1 x 1 convolution transfer functionAmong them. The process is defined as:
(9)
in the middle ofRepresenting concatenation along the spatial dimension,/->Representing a nonlinear activation function +.>An intermediate feature map representing spatial information encoded in the horizontal and vertical directions.
Then along the spatial dimensionDivided into two independent tensors->And->Through two 1X 1 convolution transforms +.>And->Conversion to input tensor with the same number of channels +.>. The process is defined as:
(10)
(11)
in the method, in the process of the invention,representing a sigmoid activation function.
Will outputAnd->Used as attention weight, finally coordinate attention block +.>And outputting. The process is defined as:
(12)
s3: the feature fusion processing is carried out on the feature tensor by the STCP module based on three-branch pooling, so that the joint with the most abundant information in the specific frame is distinguished from the whole time frame sequence, and the fine-granularity double interaction space-time feature tensor is obtained. Firstly, carrying out average pooling operation on input features on a frame level and an articular level respectively, and carrying out local average pooling and local partial pooling on the feature vectors subjected to time dimension pooling to obtain the articulation point data with different importance corresponding to double interaction actions. The process is defined as:
(13)
(14)
(15)
in the middle ofIndicated are pooling operations of the corresponding dimensions.
The space-time dimension feature vectors are then pooled as input through the channel dimension, followed by combining and concatenating the three branch feature vectors together, and compressing the information through the fully concatenated layer. The process is defined as:
(16)
(17)
in the middle ofRepresenting dot product->Indicating the connection operation +_>Representing HardSwish activation function, +.>Representing a trainable parameter;
and then, three independent full-connection layers are utilized to obtain the attention scores of the time frame dimension, the joint dimension and the channel dimension, and finally, the attention scores are multiplied to obtain the local attention map of the space-time channel as the attention score of the whole action sequence.
(18)
In the middle ofRepresenting a sigmoid activation function,/->Representing the Swish activation function.
S4: and the finally obtained characteristic tensor helps the network to converge through the full connection layer and the Softmax layer so as to output the double interaction category.
In order to extract the double interaction space features more effectively, the method adds a transducer into a backbone network, extracts the space feature vectors again through a transducer feature extractor after the primary feature extraction is carried out on the space map convolution, and captures the lost important joint information, so that the space map convolution part of the backbone network can fully retain the detail information, residual connection is added inside, and the model training time is greatly shortened. The model is therefore superior to other networks in terms of performance in the spatial feature extraction part.
In order to increase the receptive field of time dimension and solve the long-time dependence problem, the method introduces multi-scale convolution to obtain multi-scale information, and simultaneously, in order to enhance the sensitivity of a network model to an information channel and improve the position sensing capability, a position sensing attention module is added.
After the method is used for constructing a double interaction space-time feature extraction method based on a transducer and multi-scale position sensing, the importance of different body parts in the whole action sequence, time frames and importance of channels on weighted bone joints in different action stages are considered, an STCP module is designed, the module is divided into a three-branch structure of a time dimension, a space dimension and a channel dimension, pooling operation is respectively carried out in space and time, and in the time dimension, joint point data with different importance corresponding to double interaction actions is obtained through partial segmentation and partial pooling; then carrying out dimension pooling operation on the two; the obtained feature vectors are connected and the attention scores on the space local joints, time and channels are obtained through three full-connection layers; finally, multiplying the three to obtain the local attention map of the space-time channel.
In summary, the method realizes the combination of local information and global information and captures fine important joint details in the space-time feature extraction process, improves the accuracy of identifying double interaction actions by a reference model, has high embedding property, and can be conveniently embedded into other network models.
Claims (1)
1. A double interaction space-time feature extraction method is characterized by comprising the following steps:
s1: preprocessing skeleton data of a dataset, and extracting double interaction action categories to obtain action tensors;
s2: extracting double interaction space-time characteristics through a space-time diagram convolution network, and capturing global and local information;
s3: performing feature fusion processing on the feature tensor through STCP based on three-branch pooling to obtain a fine-granularity double interaction space-time feature tensor;
s4: the finally obtained characteristic tensor is used for helping the network convergence through the full connection layer and the Softmax layer so as to output the double interaction action category;
the extracting the double interaction space-time characteristics through the space-time diagram convolution network comprises the following steps:
double interaction space features are convolutionally extracted based on a transducer space diagram, double interaction time features are convolutionally extracted based on a multi-scale position sensing time diagram, deep extraction of space-time features is achieved, and local information and global information are captured;
the method for extracting the double interaction space features comprises the following steps:
firstly, carrying out 1×1 convolution on an input skeleton diagram, introducing more nonlinear factors, carrying out preliminary double interaction space feature extraction on an input vector by utilizing light space diagram convolution, and then normalizing the feature vector by a batch processing layer Batch Normalization, wherein the process is defined as follows:
F out =BN(f out ) (2)
middle lambda d For adjacency matrix A d Normalization, f in And f out Representing input and output characteristics, d represents a graph distance measurement function, and BN represents a batch normalization layer up to 2;
an encoder fransformaerencoder immediately after entering the fransformater, the encoder layer number defined as 2, wherein the encoder includes a multi-headed attention mechanism, a feed forward neural network, a normalization layer, and a residual;
an additional residual connection is adopted for the whole encoder to prevent the model from being overfitted, reduce the network parameter quantity and reduce the time complexity, and the process is defined as:
X Add =LayerNorm(X+MultiHeadAttention(X)) (4)
FFN(Z)=max(0,ZW 1 +b 1 )W 2 +b 2 (5)
Y=add(f in ,f tran ) (6)
wherein: q is input information, K is content information, V is information itself,converting the attention moment array into standard state distribution, and realizing normalization by softmax; layerNorm converts the input of each layer of neurons into a mean variance, Z represents the input vector, f in Representing an input vector f tran Representing the double interaction space-time characteristics output by the encoder;
the extraction method of the double interaction time characteristics comprises the following steps:
firstly, 4 parallel time convolution branches are carried out on the extracted space feature diagram, and each branch starts with convolution of 1 multiplied by 1; then normalizing the feature map through a batch layer and a ReLU activation function; the first two branches are then convolved with 2 3 x 1 times and 2 different syndromes are applied to fuse features between different channels to obtain a multi-scale time-receptive field; while the third branch extracts the most significant feature information in successive frames through a 3 x 1 max pooling layer; the last branch contains a residual connection to maintain the gradient during back propagation; the four branches are subjected to multi-scale feature fusion through product operation; residual connection is added to the outer layer of the multi-scale convolution to help the network to converge rapidly, and the multi-scale convolution is combined with the multi-scale features through weighted summation operation; taking a multi-scale time feature as an input, encoding each channel along a horizontal coordinate and a vertical coordinate respectively by using a pooled convolution kernel (H, 1) or (1, W) with two spatial dimensions to generate a pair of feature maps with direction perception capability, wherein the process is defined as:
in the method, in the process of the invention,and->Represents the output of the c-th channel with height h and width w, x c Representing a characteristic tensor of the c-th channel;
connecting and transmitting the generated fusion feature diagram to a shared 1×1 convolution transfer function F 1 Among these, the process is defined as:
in the formula [ · ], ·]Representing concatenation along the spatial dimension, delta represents a nonlinear activation function,an intermediate feature map representing encoded spatial information in the horizontal direction and the vertical direction;
then divide f into two independent tensors along the spatial dimensionAnd->Through two 1X 1 convolution transforms F h And F w Transformed into an input tensor X with the same number of channels, the process is defined as:
g h =σ(F h (f h )) (10)
g w =σ(F w (f w )) (11)
wherein sigma represents a sigmoid activation function;
g to be output h And g w As the attention weight, the coordinate attention block Y is finally output, the process being defined as:
in the step S3:
firstly, carrying out average pooling operation on input features on a frame level and an articular level respectively, and carrying out local average pooling and local partial pooling on feature vectors subjected to time dimension pooling to obtain articular point data with different importance corresponding to double interaction actions, wherein the process is defined as follows:
f t =pool t (f in ) (13)
f v =pool v (f in ) (14)
f p =pool p (f t ) (15)
wherein pool represents pooling operation of corresponding dimension;
then, the space-time dimension feature vectors are used as input and are subjected to channel dimension pooling, and then three branch feature vectors are combined and connected together, and information is compressed through a full connection layer; the process is defined as:
f c =pool p (f t )⊙pool v (f in ) (16)
in the formula, +.,representing the operation of the connection, θ represents the HardSwish activation function, WRepresenting a trainable parameter;
the method comprises the steps of obtaining attention scores of a time frame dimension, a joint dimension and a channel dimension by utilizing three independent full-connection layers, and finally multiplying the three to obtain a space-time channel local attention map as the attention score of the whole action sequence;
where σ represents the sigmoid activation function and φ represents the Swish activation function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310741806.5A CN116665308B (en) | 2023-06-21 | 2023-06-21 | Double interaction space-time feature extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310741806.5A CN116665308B (en) | 2023-06-21 | 2023-06-21 | Double interaction space-time feature extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116665308A CN116665308A (en) | 2023-08-29 |
CN116665308B true CN116665308B (en) | 2024-01-23 |
Family
ID=87727903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310741806.5A Active CN116665308B (en) | 2023-06-21 | 2023-06-21 | Double interaction space-time feature extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116665308B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104011723A (en) * | 2011-12-15 | 2014-08-27 | 美光科技公司 | Boolean logic in a state machine lattice |
CN111680606A (en) * | 2020-06-03 | 2020-09-18 | 淮河水利委员会水文局(信息中心) | Low-power-consumption water level remote measuring system based on artificial intelligence cloud identification water gauge |
CN111950540A (en) * | 2020-07-24 | 2020-11-17 | 浙江师范大学 | Knowledge point extraction method, system, device and medium based on deep learning |
CN112560712A (en) * | 2020-12-18 | 2021-03-26 | 西安电子科技大学 | Behavior identification method, device and medium based on time-enhanced graph convolutional network |
CN112906545A (en) * | 2021-02-07 | 2021-06-04 | 广东省科学院智能制造研究所 | Real-time action recognition method and system for multi-person scene |
CN113657349A (en) * | 2021-09-01 | 2021-11-16 | 重庆邮电大学 | Human body behavior identification method based on multi-scale space-time graph convolutional neural network |
CN114694174A (en) * | 2022-03-02 | 2022-07-01 | 北京邮电大学 | Human body interaction behavior identification method based on space-time diagram convolution |
CN114882421A (en) * | 2022-06-01 | 2022-08-09 | 江南大学 | Method for recognizing skeleton behavior based on space-time feature enhancement graph convolutional network |
WO2023024438A1 (en) * | 2021-08-24 | 2023-03-02 | 上海商汤智能科技有限公司 | Behavior recognition method and apparatus, electronic device, and storage medium |
CN115841697A (en) * | 2022-09-19 | 2023-03-24 | 上海大学 | Motion recognition method based on skeleton and image data fusion |
-
2023
- 2023-06-21 CN CN202310741806.5A patent/CN116665308B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104011723A (en) * | 2011-12-15 | 2014-08-27 | 美光科技公司 | Boolean logic in a state machine lattice |
CN111680606A (en) * | 2020-06-03 | 2020-09-18 | 淮河水利委员会水文局(信息中心) | Low-power-consumption water level remote measuring system based on artificial intelligence cloud identification water gauge |
CN111950540A (en) * | 2020-07-24 | 2020-11-17 | 浙江师范大学 | Knowledge point extraction method, system, device and medium based on deep learning |
CN112560712A (en) * | 2020-12-18 | 2021-03-26 | 西安电子科技大学 | Behavior identification method, device and medium based on time-enhanced graph convolutional network |
CN112906545A (en) * | 2021-02-07 | 2021-06-04 | 广东省科学院智能制造研究所 | Real-time action recognition method and system for multi-person scene |
WO2023024438A1 (en) * | 2021-08-24 | 2023-03-02 | 上海商汤智能科技有限公司 | Behavior recognition method and apparatus, electronic device, and storage medium |
CN113657349A (en) * | 2021-09-01 | 2021-11-16 | 重庆邮电大学 | Human body behavior identification method based on multi-scale space-time graph convolutional neural network |
CN114694174A (en) * | 2022-03-02 | 2022-07-01 | 北京邮电大学 | Human body interaction behavior identification method based on space-time diagram convolution |
CN114882421A (en) * | 2022-06-01 | 2022-08-09 | 江南大学 | Method for recognizing skeleton behavior based on space-time feature enhancement graph convolutional network |
CN115841697A (en) * | 2022-09-19 | 2023-03-24 | 上海大学 | Motion recognition method based on skeleton and image data fusion |
Also Published As
Publication number | Publication date |
---|---|
CN116665308A (en) | 2023-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108520535B (en) | Object classification method based on depth recovery information | |
Liu et al. | Two-stream 3d convolutional neural network for skeleton-based action recognition | |
CN108596039B (en) | Bimodal emotion recognition method and system based on 3D convolutional neural network | |
CN108154194B (en) | Method for extracting high-dimensional features by using tensor-based convolutional network | |
CN109948475B (en) | Human body action recognition method based on skeleton features and deep learning | |
CN110414432A (en) | Training method, object identifying method and the corresponding device of Object identifying model | |
CN112801015B (en) | Multi-mode face recognition method based on attention mechanism | |
CN110728183A (en) | Human body action recognition method based on attention mechanism neural network | |
CN114596520A (en) | First visual angle video action identification method and device | |
CN112329525A (en) | Gesture recognition method and device based on space-time diagram convolutional neural network | |
CN111695523B (en) | Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information | |
CN112036260A (en) | Expression recognition method and system for multi-scale sub-block aggregation in natural environment | |
CN112906520A (en) | Gesture coding-based action recognition method and device | |
CN113743544A (en) | Cross-modal neural network construction method, pedestrian retrieval method and system | |
CN113221663A (en) | Real-time sign language intelligent identification method, device and system | |
CN115719510A (en) | Group behavior recognition method based on multi-mode fusion and implicit interactive relation learning | |
CN111401116B (en) | Bimodal emotion recognition method based on enhanced convolution and space-time LSTM network | |
Sun et al. | 3-D Facial Feature Reconstruction and Learning Network for Facial Expression Recognition in the Wild | |
CN114333002A (en) | Micro-expression recognition method based on deep learning of image and three-dimensional reconstruction of human face | |
Hsieh et al. | Online human action recognition using deep learning for indoor smart mobile robots | |
CN113850182A (en) | Action identification method based on DAMR-3 DNet | |
CN117333908A (en) | Cross-modal pedestrian re-recognition method based on attitude feature alignment | |
CN116665308B (en) | Double interaction space-time feature extraction method | |
CN110782503B (en) | Face image synthesis method and device based on two-branch depth correlation network | |
CN116863241A (en) | End-to-end semantic aerial view generation method, model and equipment based on computer vision under road scene |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |