CN114648722B - Motion recognition method based on video multipath space-time characteristic network - Google Patents

Motion recognition method based on video multipath space-time characteristic network Download PDF

Info

Publication number
CN114648722B
CN114648722B CN202210362715.6A CN202210362715A CN114648722B CN 114648722 B CN114648722 B CN 114648722B CN 202210362715 A CN202210362715 A CN 202210362715A CN 114648722 B CN114648722 B CN 114648722B
Authority
CN
China
Prior art keywords
feature
pooling
video
space
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210362715.6A
Other languages
Chinese (zh)
Other versions
CN114648722A (en
Inventor
张海平
胡泽鹏
刘旭
马琮皓
管力明
施月玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
School of Information Engineering of Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
School of Information Engineering of Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University, School of Information Engineering of Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210362715.6A priority Critical patent/CN114648722B/en
Publication of CN114648722A publication Critical patent/CN114648722A/en
Application granted granted Critical
Publication of CN114648722B publication Critical patent/CN114648722B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an action recognition method based on a video multipath space-time characteristic network, which comprises the following steps: acquiring a video to be identified, extracting a plurality of images from the video according to the frame rate, and preprocessing the images; extracting different numbers of images from the preprocessed images according to different sampling rates to form a plurality of image sequences; establishing a space-time feature network model, wherein the space-time feature network model comprises a plurality of feature extraction modules, and each image sequence is input to the feature extraction module in a one-to-one correspondence manner to obtain a space-time feature matrix; aggregating the space-time feature matrixes output by the feature extraction modules to output feature vectors; and classifying and detecting the feature vectors by using a classifier, wherein the class with the highest probability is used as a detection result. The method can greatly improve the accuracy of classifying the action videos, is beneficial to enhancing the understanding of the network model to the action videos, and remarkably improves the robustness, so that the method can cope with complex scenes in real life.

Description

Motion recognition method based on video multipath space-time characteristic network
Technical Field
The invention belongs to the field of deep learning video understanding, and particularly relates to an action recognition method based on a video multipath space-time feature network.
Background
The rapid growth of the video market benefits from technical innovations in mobile internet and intelligent digital devices, etc. Today, smart mobile devices can store thousands of videos, and mobile applications allow users to conveniently access hundreds of video websites through the mobile internet. Video is therefore becoming increasingly important in many areas. For example, the action recognition can be applied to uploading and auditing of a large number of videos of a website every day, is used for video monitoring of dangerous actions and dangerous behaviors, and is even applied to the fields of robot action technology and the like. However, conventional deep learning methods often involve problems of low accuracy and slow speed, especially when dealing with large numbers of video scenes and complex action video scenes.
In current artificial intelligence deep learning methods, action classification is typically achieved through two mechanisms. One approach is to use a dual stream network, where one stream is located on an RGB frame for extracting spatial information and the other is to capture temporal information using optical flow as input. The addition of the optical flow module in the double-flow mode can greatly improve the accuracy of motion recognition, but the optical flow has quite high calculation cost. Another approach is to learn spatiotemporal features from multi-frame RGB images by 3D convolution. The 3DCNN can effectively extract the spatio-temporal information, but since the spatio-temporal information is extracted together, this type of network lacks a specific consideration for the time dimension, and also cannot acquire a specific front-rear motion difference according to the optical flow information like in the dual-flow network, much important information is lost in the process of extracting the features. Therefore, it remains a challenge to separate temporal information from spatial information better in 3DCNN networks so that they express their respective characteristic information more explicitly. In particular also in the extraction of spatial and temporal information in video clips. The spatial information represents static information in a single-frame scene, such as information of action entities, related specific action forms and the like in a video; the temporal information represents the integration of spatial information over multiple frames to obtain motion context related information. Therefore, it is necessary to design an effective deep learning method for these two parts to improve the accuracy of motion recognition.
Disclosure of Invention
The invention aims to solve the problems, and provides an action recognition method based on a video multipath space-time characteristic network, which can greatly improve the accuracy of action video classification, is beneficial to enhancing the understanding of a network model on the action video, and remarkably improves the robustness, so that the method can cope with complex scenes in real life.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the invention provides a motion recognition method based on a video multipath space-time characteristic network, which comprises the following steps:
s1, acquiring a video to be identified, extracting a plurality of images from the video according to a frame rate, and preprocessing the images;
s2, respectively extracting different numbers of images from the preprocessed images according to different sampling rates to form a plurality of image sequences;
s3, establishing a space-time feature network model, wherein the space-time feature network model comprises a plurality of feature extraction modules, each image sequence is input to the feature extraction modules in a one-to-one correspondence mode, and the feature extraction modules execute the following operations:
s31, obtaining intermediate characteristics X epsilon R of corresponding image sequences N×T×C×H×W Wherein N is the batch size, T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image;
s32, uniformly dividing the intermediate feature X into a first feature matrix X 0 And a second feature matrix X 1 And calculate the difference X 1 -X 0 As a difference feature, wherein X 0 Is the first half of the intermediate feature X, X 1 Is the latter half of the intermediate feature X, X 0 、X 1 ∈R N ×(T/2)×C×H×W
S33, outputting the spatial attention features by sequentially passing the difference features through the maximum pooling layer, the first multi-layer perceptron and the sigmoid layer;
s34, multiplying the spatial attention characteristic with the intermediate characteristic X point and then adding the spatial attention characteristic with the intermediate characteristic X point to obtain a spatial characteristic diagram;
s35, inputting the spatial feature map into a parallel maximum pooling layer and an average pooling layer to correspondingly obtain a first maximum pooling feature map and a first average pooling feature map;
s36, inputting the first maximum pooling feature map and the first average pooling feature map into a second multi-layer perceptron to correspondingly obtain a second maximum pooling feature map and a second average pooling feature map;
s37, connecting the second maximum pooling feature map and the second average pooling feature map with a second dimension through concat operation, and obtaining a fusion feature map through a convolution layer;
s38, respectively and correspondingly obtaining a first pooling information graph, a second pooling information graph and a third pooling information graph through a sigmoid layer by the second maximum pooling feature graph, the second average pooling feature graph and the fusion feature graph;
s39, adding the first pooling information diagram, the second pooling information diagram and the third pooling information diagram to form a fourth pooling information diagram, multiplying the fourth pooling information diagram by the space feature diagram point, adding the fourth pooling information diagram and the space feature diagram point, and outputting a space-time feature matrix;
s4, aggregating the space-time feature matrixes output by the feature extraction modules to output feature vectors;
and S5, classifying and detecting the feature vector by using a classifier, and taking the class with the highest probability as a detection result.
Preferably, in step S1, the preprocessing is to randomly crop the image to a width and height of [256,320] pixels.
Preferably, in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.
Preferably, in step S37, the second maximum pooling feature map and the second average pooling feature map are connected to the second dimension through a concat operation, and the obtaining the fusion feature map through a convolution layer further includes a squeeze operation and an unsqueeze operation, where the convolution layer is a 1D convolution layer, and the squeeze operation, the concat operation, the 1D convolution layer and the unsqueeze operation are sequentially performed.
Preferably, the first multi-layer perceptron has a reduction coefficient r and an amplification coefficient 2r, and the second multi-layer perceptron has a reduction coefficient r and an amplification coefficient r, where r=16.
Preferably, in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, the weight ratio of each spatio-temporal feature matrix is 1:1.
compared with the prior art, the invention has the beneficial effects that:
according to the method, the acquired video frames to be identified are taken as images, a plurality of image sequences are acquired at different sampling rates and are used as multi-level input of a space-time characteristic network model, time sequence modeling is naturally carried out on the acquired image sequences, and difference operation is carried out on intermediate characteristics extracted from corresponding image sequences, so that the interference of video background on motion identification accuracy can be greatly reduced on the premise of not increasing calculated amount, sensitive information of motion in time dimension can be effectively aggregated by average pooling characteristics and maximum pooling characteristics, global modeling is carried out on the whole video, and in the process, the robustness of the space-time characteristic network model can be continuously enhanced, so that the extracted characteristics of each layer of the space-time characteristic matrix output by each layer of characteristic extraction module can be represented when the pooling information graphs are aggregated, and the accuracy of motion video classification can be greatly improved; and by fusing a plurality of space-time feature matrixes, the method can help to enhance the understanding of the network model to the action video, and remarkably improve the robustness, so that the method can cope with complex scenes in real life.
Drawings
FIG. 1 is a flow chart of the motion recognition method of the present invention;
FIG. 2 is a general architecture diagram of the motion recognition method of the present invention;
FIG. 3 is a schematic diagram of a spatial differential module according to the present invention;
fig. 4 is a schematic structural diagram of an attention timing module according to the present invention.
Detailed Description
The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It is noted that unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
As shown in fig. 1-4, a motion recognition method based on a video multipath space-time feature network includes the following steps:
s1, acquiring a video to be identified, extracting a plurality of images from the video according to a frame rate, and preprocessing the images. The number of images extracted from Video (Video) (total number of frames of Video) is the number of frames per second of Video (frame rate) times the total number of seconds of Video.
In one embodiment, in step S1, the preprocessing is to randomly crop the image to a width and height of [256,320] pixels.
S2, respectively extracting different numbers of images from the preprocessed images according to different sampling rates to form a plurality of image sequences.
Wherein, m sampling rates are respectively set as [ tau ] 1 ,τ 2 ,...,τ m ]The decimated images at each sampling rate form a sequence of images (Sample), wherein the m sequences of images have dimensions ofWherein T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image.
S3, establishing a space-time feature network model, wherein the space-time feature network model comprises a plurality of feature extraction modules, each image sequence is input to the feature extraction modules in a one-to-one correspondence mode, and the feature extraction modules execute the following operations:
s31, obtaining intermediate characteristics X epsilon R of corresponding image sequences N×T×C×H×W Wherein N is the batch size, T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image;
s32, uniformly dividing the intermediate feature X into a first feature matrix X 0 And a second feature matrix X 1 Wherein, and calculate the difference X 1 -X 0 As a difference feature, X 0 Is the first half of the intermediate feature X, X 1 Is the latter half of the intermediate feature X, X 0 、X 1 ∈R N ×(T/2)×C×H×W
S33, outputting the spatial attention features by sequentially passing the difference features through the maximum pooling layer, the first multi-layer perceptron and the sigmoid layer;
s34, multiplying the spatial attention characteristic with the intermediate characteristic X point and then adding the spatial attention characteristic with the intermediate characteristic X point to obtain a spatial characteristic diagram;
s35, inputting the spatial feature map into a parallel maximum pooling layer and an average pooling layer to correspondingly obtain a first maximum pooling feature map and a first average pooling feature map;
s36, inputting the first maximum pooling feature map and the first average pooling feature map into a second multi-layer perceptron to correspondingly obtain a second maximum pooling feature map and a second average pooling feature map;
s37, connecting the second maximum pooling feature map and the second average pooling feature map with a second dimension through concat operation, and obtaining a fusion feature map through a convolution layer;
s38, respectively and correspondingly obtaining a first pooling information graph, a second pooling information graph and a third pooling information graph through a sigmoid layer by the second maximum pooling feature graph, the second average pooling feature graph and the fusion feature graph;
s39, adding the first pooling information diagram, the second pooling information diagram and the third pooling information diagram to form a fourth pooling information diagram, multiplying the fourth pooling information diagram by the space feature diagram point, adding the fourth pooling information diagram and the space feature diagram point, and outputting a space-time feature matrix.
In one embodiment, in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.
In one embodiment, in step S37, the second maximum pooling feature map and the second average pooling feature map are passed through co n ca the t operation is connected with the second dimension, and the fusion feature map obtained through the convolution layer further comprises a squeeze operation and an unsqueeze operation, wherein the convolution layer is a 1D convolution layer, and the squeeze operation, the concat operation, the 1D convolution layer and the unsqueeze operation are sequentially carried out.
In an embodiment, the first multi-layer perceptron has a reduction coefficient r and an amplification coefficient 2r, and the second multi-layer perceptron has a reduction coefficient r and an amplification coefficient r, where r=16.
Each feature extraction module includes, as shown in fig. 2, a backhaul framework, such as a res net framework, that is sequentially connected to each other, a Spatial-differential module (Spatial-Difference Modulation), and a attention-sequential module (Temporal-Attention Modulation). The ith image sequence is denoted as F i And will { F 1 ,F 2 ,...,F m The elements in the sequence are in one-to-one correspondence as the input of m feature extraction modules, and the intermediate features X of the corresponding image sequence are acquired through a backstone framework. In this embodiment, m is set to 2, n=32, and the Batch Size (batch_size) can be adjusted according to actual requirements. The method comprises the following steps:
as shown in fig. 3, the spatial difference module comprises a first extraction unit (Difference operation), a maximum pooling layer (MaxPooling), a first multi-layer perceptron (MLP) and a SIGMOID layer (SIGMOID), and a second extraction unit (including point multiplication and addition operations), wherein the first extraction unit is used to uniformly divide the intermediate feature X into a first feature matrix X 0 And a second feature matrix X 1 And calculate the difference X 1 -X 0 By extracting the difference characteristics through the subtraction operation, the interference of the motion recognition video background on the motion recognition accuracy can be greatly reduced under the condition of not increasing the computational complexity. The difference features sequentially pass through a maximum pooling layer, a first multi-layer perceptron and sigThe moid layer outputs spatial attention characteristics, and front-back characteristic differences are effectively extracted through the 3D maximum pooling layer to obtain F max ∈R N×(T/2r)×1×1×1 . And then F is arranged max ∈R N×(T/2r)×1×1×1 Through a first multi-layer perceptron, wherein the first multi-layer perceptron comprises a first 3D convolution layer, a ReLU layer and a second 3D convolution layer which are sequentially connected, in order to reduce parameter overhead and improve feature extraction effect, the first multi-layer perceptron uses F max ∈R N×(T/2r)×1×1×1 Firstly, shrinking and then amplifying, wherein the reduction coefficient is r, and the amplification coefficient is 2r, if r=16, F is obtained mlp ∈R N×T×1×1×1 . Will F mlp Input to the sigmoid layer obtains corresponding spatial attention features. The second extraction unit multiplies the spatial attention feature by the intermediate feature X point and then adds the spatial attention feature to the intermediate feature X to obtain a spatial feature map (Spatial Attention), wherein the spatial feature map has the following calculation formula:
Y=X+X·δ(MLP(Max(D(X))))
where X is the intermediate feature of the backbond frame output and D is the differential operation (i.e., X 1 -X 0 ) Max is the Max pooling operation, MLP is the first multi-layer perceptron operation, delta is the sigmoid operation. The above is a specific structure and operation of one spatial differential module, and other spatial differential modules are similar, only correspond to different outputs, and the convolution kernel size is different, which is not described herein.
As shown in fig. 4, the attention timing module includes a parallel maximum pooling layer (MaxPooling) and average pooling layer (AvgPooling), a second multi-layer perceptron (Shared-MLP), a squeeze operation, a concat operation (C), a 1D convolutional layer (1 DCNN), an unsqueeze operation, three sigmoid layers (sigmoid), and a third extraction unit (including addition, dot multiplication, addition). The method comprises the steps of extracting a space feature map by using a 3D maximum pooling layer to obtain a first maximum pooling feature map, and extracting the space feature map by using a 3D average pooling layer to obtain a first average pooling feature map. Respectively extracting the first maximum pooling feature map and the first average pooling feature map by using a second multi-layer perceptron to correspondingly obtain a second maximum pooling feature map and a second average pooling feature mapThe second multi-layer perceptron is similar in structure to the first multi-layer perceptron, but the reduction and amplification coefficients are r, r=16. The second maximum pooling feature map and the second average pooling feature map are subjected to squeeze operation, concat operation, 1D convolution layer and unsqueeze operation in sequence to obtain a fusion feature map, and specifically, the second maximum pooling feature map and the second average pooling feature map are subjected to squeeze operation respectively to correspondingly obtain F '' max With F' avg The dimensions are R N×T×1 . Ligation F 'by concat operation' max And F' avg Obtain F from the second dimension of (2) ios ,F ios ∈R N×2T×1 . And then F is arranged ios The link between the average and maximum features is further increased by a 1D convolution layer with convolution kernel size (3, 3). And finally, restoring the original dimension by using an unsqueeze operation to obtain a fusion feature map. Extracting a second maximum pooling feature map, a second average pooling feature map and a fusion feature map by utilizing one-to-one correspondence of three sigmoid layers, and correspondingly obtaining a first pooling information map F temp1 ∈R N ×T×1×1×1 Second pooled information map F temp2 ∈R N×T×1×1×1 And a third pooled information map F temp3 ∈R N×T×1×1×1 . And adding the first pooling information diagram, the second pooling information diagram and the third pooling information diagram to form a fourth pooling information diagram, multiplying the fourth pooling information diagram by the space feature diagram point, adding the fourth pooling information diagram and the space feature diagram, and outputting a space-time feature matrix. The calculation formula of each pooled information graph and the space-time feature matrix is as follows:
F temp1 =δ(SMLP(Max(X’)))
F temp2 =δ(SMLP(Avg(X’)))
F temp3 =δ(unsqueeze(Conv(squeeze([(Avg(X’),Max(X’)])))))
Y’=X’+X’·(F temp1 +F temp2 +F temp3 )
wherein Y 'is a space-time feature matrix, X' is a space feature map, delta is a sigmoid operation, SMLP is a second multi-layer perceptron operation, max is a maximum pooling operation, avg is an average pooling operation, conv is a convolution operation, squeeze is a squeeze operation, and unsqueeze is an unsqueeze operation.
The attention time sequence module can effectively aggregate the average pooling feature and the maximum pooling feature to extract the sensitive information of the action in the time dimension. The problem of prior art network insensitivity to specific action behavior important features in time series modeling can be significantly improved, for example, in a shot video, more attention should be paid to the position of the ball and hand over time in the video, rather than focusing on the body part of the athlete as in prior art network errors. It should be noted that, each attention sequential module performs similar operations, and will not be described herein.
And S4, aggregating the space-time feature matrixes output by the feature extraction modules to output feature vectors.
In one embodiment, in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, the weight ratio of each spatio-temporal feature matrix is 1:1.
The dimensions of the space-time feature matrix output by each feature extraction module are the same, and the weight ratio is preferably 1:1, performing addition operation (i.e. aggregation Fusion) in alignment, and fusing a plurality of space-time feature matrixes can help to enhance understanding of a network to an action video, so that robustness is remarkably improved, and complex scenes in real life are dealt with.
And S5, classifying and detecting the feature vector by using a classifier, and taking the class with the highest probability as a detection result. The classifier adopts a linear connection layer of a neural network, outputs the probability that the video to be recognized belongs to each category, and takes the category with the highest probability as an action video result.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above-described embodiments are merely representative of the more specific and detailed embodiments described herein and are not to be construed as limiting the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (6)

1. A motion recognition method based on a video multipath space-time characteristic network is characterized by comprising the following steps of: the motion recognition method based on the video multipath space-time characteristic network comprises the following steps:
s1, acquiring a video to be identified, extracting a plurality of images from the video according to a frame rate, and preprocessing the images;
s2, respectively extracting different numbers of images from the preprocessed images according to different sampling rates to form a plurality of image sequences;
s3, establishing a space-time characteristic network model, wherein the space-time characteristic network model comprises a plurality of characteristic extraction modules, each image sequence is input to the characteristic extraction module in a one-to-one correspondence manner, and the characteristic extraction module performs the following operations:
s31, obtaining intermediate characteristics X epsilon R of corresponding image sequences N×T×C×H×W Wherein N is the batch size, T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image;
s32, uniformly dividing the intermediate feature X into a first feature matrix X 0 And a second feature matrix X 1 And calculate the difference X 1 -X 0 As a difference feature, wherein X 0 Is the first half of the intermediate feature X, X 1 Is the latter half of the intermediate feature X, X 0 、X 1 ∈R N×(T/2)×C×H×W
S33, outputting the spatial attention features by sequentially passing the difference features through the maximum pooling layer, the first multi-layer perceptron and the sigmoid layer;
s34, multiplying the spatial attention characteristic with the intermediate characteristic X point and then adding the spatial attention characteristic with the intermediate characteristic X point to obtain a spatial characteristic diagram;
the calculation formula of the space feature map is as follows:
Y=X+X·δ(MLP(Max(D(X))))
wherein D (X) =x 1 -X 0 Max is the maximum pooling operation, MLP is the first multi-layer perceptron operation, delta is the sigmoid operation;
s35, inputting the spatial feature map into a parallel maximum pooling layer and an average pooling layer to correspondingly obtain a first maximum pooling feature map and a first average pooling feature map;
s36, inputting the first maximum pooling feature map and the first average pooling feature map into a second multi-layer perceptron to correspondingly obtain a second maximum pooling feature map and a second average pooling feature map;
s37, connecting the second maximum pooling feature map and the second average pooling feature map with a second dimension through concat operation, and obtaining a fusion feature map through a convolution layer;
s38, respectively and correspondingly obtaining a first pooling information graph, a second pooling information graph and a third pooling information graph through a sigmoid layer by the second maximum pooling feature graph, the second average pooling feature graph and the fusion feature graph;
s39, adding the first pooling information diagram, the second pooling information diagram and the third pooling information diagram to form a fourth pooling information diagram, multiplying the fourth pooling information diagram with the space feature diagram point, adding the fourth pooling information diagram with the space feature diagram, and outputting a space-time feature matrix;
s4, aggregating the space-time feature matrixes output by the feature extraction modules to output feature vectors;
and S5, classifying and detecting the feature vector by using a classifier, wherein the class with the highest probability is used as a detection result.
2. The method for motion recognition based on a video multipath spatio-temporal feature network of claim 1, characterized in that: in step S1, the preprocessing is to randomly clip the image to the pixels with the width and the height of [256,320 ].
3. The method for motion recognition based on a video multipath spatio-temporal feature network of claim 1, characterized in that: in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.
4. The method for motion recognition based on a video multipath spatio-temporal feature network of claim 1, characterized in that: in step S37, the second maximum pooling feature map and the second average pooling feature map are connected to the second dimension through a concat operation, and the obtaining of the fusion feature map through a convolution layer further includes a squeeze operation and an unsqueeze operation, where the convolution layer is a 1D convolution layer, and the squeeze operation, the concat operation, the 1D convolution layer and the unsqueeze operation are sequentially performed.
5. The method for motion recognition based on a video multipath spatio-temporal feature network of claim 1, characterized in that: the reduction coefficient and the amplification coefficient of the first multi-layer perceptron are r and 2r, and the reduction coefficient and the amplification coefficient of the second multi-layer perceptron are r and r=16.
6. The method for motion recognition based on a video multipath spatio-temporal feature network of claim 1, characterized in that: in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, a weight ratio of each spatio-temporal feature matrix is 1:1.
CN202210362715.6A 2022-04-07 2022-04-07 Motion recognition method based on video multipath space-time characteristic network Active CN114648722B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210362715.6A CN114648722B (en) 2022-04-07 2022-04-07 Motion recognition method based on video multipath space-time characteristic network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210362715.6A CN114648722B (en) 2022-04-07 2022-04-07 Motion recognition method based on video multipath space-time characteristic network

Publications (2)

Publication Number Publication Date
CN114648722A CN114648722A (en) 2022-06-21
CN114648722B true CN114648722B (en) 2023-07-18

Family

ID=81997696

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210362715.6A Active CN114648722B (en) 2022-04-07 2022-04-07 Motion recognition method based on video multipath space-time characteristic network

Country Status (1)

Country Link
CN (1) CN114648722B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631050B (en) * 2023-04-20 2024-02-13 北京电信易通信息技术股份有限公司 Intelligent video conference-oriented user behavior recognition method and system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933417A (en) * 2015-06-26 2015-09-23 苏州大学 Behavior recognition method based on sparse spatial-temporal characteristics
CN111123257A (en) * 2019-12-30 2020-05-08 西安电子科技大学 Radar moving target multi-frame joint detection method based on graph space-time network
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
CN113408343A (en) * 2021-05-12 2021-09-17 杭州电子科技大学 Classroom action recognition method based on double-scale space-time block mutual attention
CN113850182A (en) * 2021-09-23 2021-12-28 浙江理工大学 Action identification method based on DAMR-3 DNet
CN114037930A (en) * 2021-10-18 2022-02-11 苏州大学 Video action recognition method based on space-time enhanced network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160297B (en) * 2019-12-31 2022-05-13 武汉大学 Pedestrian re-identification method and device based on residual attention mechanism space-time combined model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933417A (en) * 2015-06-26 2015-09-23 苏州大学 Behavior recognition method based on sparse spatial-temporal characteristics
CN111123257A (en) * 2019-12-30 2020-05-08 西安电子科技大学 Radar moving target multi-frame joint detection method based on graph space-time network
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN113408343A (en) * 2021-05-12 2021-09-17 杭州电子科技大学 Classroom action recognition method based on double-scale space-time block mutual attention
CN113850182A (en) * 2021-09-23 2021-12-28 浙江理工大学 Action identification method based on DAMR-3 DNet
CN114037930A (en) * 2021-10-18 2022-02-11 苏州大学 Video action recognition method based on space-time enhanced network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于时空相关性的HEVC帧间模式决策快速算法》;朱威;《通信学报》;第37卷(第4期);64-73 *

Also Published As

Publication number Publication date
CN114648722A (en) 2022-06-21

Similar Documents

Publication Publication Date Title
Liu et al. Robust video super-resolution with learned temporal dynamics
Kim et al. Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment
Linardos et al. Simple vs complex temporal recurrences for video saliency prediction
CN107977932B (en) Face image super-resolution reconstruction method based on discriminable attribute constraint generation countermeasure network
Cheng et al. Memory-efficient network for large-scale video compressive sensing
CN109993269B (en) Single image crowd counting method based on attention mechanism
CN112465727A (en) Low-illumination image enhancement method without normal illumination reference based on HSV color space and Retinex theory
CN112257526B (en) Action recognition method based on feature interactive learning and terminal equipment
TWI761813B (en) Video analysis method and related model training methods, electronic device and storage medium thereof
CN112446342A (en) Key frame recognition model training method, recognition method and device
CN114463218B (en) Video deblurring method based on event data driving
CN113255616B (en) Video behavior identification method based on deep learning
CN114648722B (en) Motion recognition method based on video multipath space-time characteristic network
CN111079864A (en) Short video classification method and system based on optimized video key frame extraction
CN112200096B (en) Method, device and storage medium for realizing real-time abnormal behavior identification based on compressed video
CN112446348A (en) Behavior identification method based on characteristic spectrum flow
CN111310516B (en) Behavior recognition method and device
US20190379845A1 (en) Code division compression for array cameras
CN112528077B (en) Video face retrieval method and system based on video embedding
US20240062347A1 (en) Multi-scale fusion defogging method based on stacked hourglass network
Sun et al. Video snapshot compressive imaging using residual ensemble network
CN108596831B (en) Super-resolution reconstruction method based on AdaBoost example regression
CN116524596A (en) Sports video action recognition method based on action granularity grouping structure
Li et al. REQA: Coarse-to-fine assessment of image quality to alleviate the range effect
CN112446245A (en) Efficient motion characterization method and device based on small displacement of motion boundary

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant