CN114648722A - Action identification method based on video multipath space-time characteristic network - Google Patents

Action identification method based on video multipath space-time characteristic network Download PDF

Info

Publication number
CN114648722A
CN114648722A CN202210362715.6A CN202210362715A CN114648722A CN 114648722 A CN114648722 A CN 114648722A CN 202210362715 A CN202210362715 A CN 202210362715A CN 114648722 A CN114648722 A CN 114648722A
Authority
CN
China
Prior art keywords
feature
map
pooling
video
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210362715.6A
Other languages
Chinese (zh)
Other versions
CN114648722B (en
Inventor
张海平
胡泽鹏
刘旭
马琮皓
管力明
施月玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
School of Information Engineering of Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
School of Information Engineering of Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University, School of Information Engineering of Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210362715.6A priority Critical patent/CN114648722B/en
Publication of CN114648722A publication Critical patent/CN114648722A/en
Application granted granted Critical
Publication of CN114648722B publication Critical patent/CN114648722B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a motion identification method based on a video multipath space-time characteristic network, which comprises the following steps: acquiring a video to be identified, extracting a plurality of images from the video according to a frame rate, and preprocessing the images; extracting different numbers of images from the preprocessed images respectively according to different sampling rates to form a plurality of image sequences; establishing a space-time characteristic network model, wherein the space-time characteristic network model comprises a plurality of characteristic extraction modules, and all image sequences are input to the characteristic extraction modules in a one-to-one correspondence manner to obtain a space-time characteristic matrix; aggregating the space-time characteristic matrixes output by the characteristic extraction modules and outputting characteristic vectors; and (4) carrying out classification detection on the feature vectors by using a classifier, and taking the class with the highest probability as a detection result. The method can greatly improve the accuracy of motion video classification, is beneficial to enhancing the understanding of a network model to the motion video, and obviously improves the robustness, thereby being capable of dealing with complex scenes in real life.

Description

Action identification method based on video multipath space-time characteristic network
Technical Field
The invention belongs to the field of deep learning video understanding, and particularly relates to an action identification method based on a video multipath spatiotemporal feature network.
Background
The rapid growth of the video market benefits from technological innovations in mobile internet and intelligent digital devices, etc. Today, smart mobile devices can store thousands of videos, and mobile applications allow users to conveniently access hundreds of video websites over the mobile internet. Video is therefore becoming increasingly important in many areas. For example, the action recognition can be applied to uploading and auditing of a large number of videos of a website every day, is used for monitoring dangerous actions and dangerous behaviors through videos, and is even applied to the fields of robot action technology and the like. However, the conventional deep learning method generally involves problems of low precision and slow speed, and is not satisfactory especially when a large number of video scenes and complex motion video scenes are processed.
In current artificial intelligence deep learning methods, action classification is typically implemented by two mechanisms. One approach is to use a dual stream network, where one stream is placed over the RGB frames for extracting spatial information, and the other is to capture temporal information using optical flow as input. The addition of the optical flow module in the dual-flow mode can greatly improve the accuracy of motion recognition, but the calculation cost of the optical flow is very expensive. Another approach is to learn spatio-temporal features from multiple frames of RGB images by 3D convolution. The 3DCNN can effectively extract spatio-temporal information, but since spatio-temporal information is extracted together, this type of network lacks specific consideration on the time dimension, and cannot acquire specific pre-post action differences according to optical flow information like in a dual-flow network, and much important information is lost in the process of extracting features. Therefore, it is still a challenge how to better separate the temporal information and the spatial information in the 3DCNN network so that they can express their respective characteristic information more clearly. It also consists in the extraction of spatial and temporal information in video segments. The spatial information represents static information in a single-frame scene, such as information of action entities in a video, related specific action forms and the like; temporal information represents the integration of spatial information over multiple frames to obtain action context related information. Therefore, it is necessary to design an effective deep learning method for the two parts to improve the accuracy of motion recognition.
Disclosure of Invention
The invention aims to provide a motion identification method based on a video multipath space-time characteristic network, which can greatly improve the accuracy of motion video classification, is beneficial to enhancing the understanding of a network model to motion videos, and obviously improves the robustness, so that the method can cope with complex scenes in real life.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
the invention provides a motion identification method based on a video multipath space-time characteristic network, which comprises the following steps:
s1, acquiring a video to be identified, extracting a plurality of images from the video according to the frame rate, and preprocessing the images;
s2, extracting different numbers of images from the preprocessed images respectively according to different sampling rates to form a plurality of image sequences;
s3, establishing a space-time characteristic network model, wherein the space-time characteristic network model comprises a plurality of characteristic extraction modules, the image sequences are input into the characteristic extraction modules in a one-to-one correspondence manner, and the characteristic extraction modules execute the following operations:
s31, obtaining the middle characteristic X belonging to the R of the corresponding image sequenceN×T×C×H×WWherein N is the batch size, T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image;
s32, dividing the intermediate feature X equally into a first feature matrix X0And a second feature matrix X1And calculating a difference X1-X0As a difference feature, wherein X0Is the first half of the middle feature X, X1The latter half of the intermediate feature X, X0、X1∈RN ×(T/2)×C×H×W
S33, sequentially passing the difference features through a maximum pooling layer, a first multilayer perceptron and a sigmoid layer to output spatial attention features;
s34, multiplying the spatial attention feature by the intermediate feature X point, and adding the multiplied spatial attention feature and the intermediate feature X to obtain a spatial feature map;
s35, inputting the spatial feature map into a parallel maximum pooling layer and an average pooling layer to correspondingly obtain a first maximum pooling feature map and a first average pooling feature map;
s36, inputting the first maximum pooling feature map and the first average pooling feature map into the second multilayer perceptron to obtain a second maximum pooling feature map and a second average pooling feature map correspondingly;
s37, connecting the second maximum pooling feature map and the second average pooling feature map with a second dimension through concat operation, and obtaining a fusion feature map through the convolution layer;
s38, the second maximum pooling feature map, the second average pooling feature map and the fusion feature map are respectively subjected to sigmoid layer correspondence to obtain a first pooling information map, a second pooling information map and a third pooling information map;
s39, adding the first pooled information map, the second pooled information map and the third pooled information map to form a fourth pooled information map, multiplying the fourth pooled information map by the spatial feature map, adding the fourth pooled information map to the spatial feature map, and outputting a space-time feature matrix;
s4, aggregating the space-time feature matrixes output by the feature extraction modules and outputting feature vectors;
and S5, classifying and detecting the feature vectors by using the classifier, and taking the class with the highest probability as a detection result.
Preferably, in step S1, the pre-processing is to randomly crop the image to [256,320] pixels in width and height.
Preferably, in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.
Preferably, in step S37, the second maximum pooling feature map and the second average pooling feature map are connected to a second dimension by a concat operation, and obtaining the fused feature map by the convolution layer further includes an squeeze operation and an unsqueeze operation, where the convolution layer is a 1D convolution layer, and the squeeze operation, the concat operation, the 1D convolution layer and the unsqueeze operation are performed sequentially.
Preferably, the reduction coefficient and the amplification coefficient of the first multi-layer perceptron are r and 2r, and both the reduction coefficient and the amplification coefficient of the second multi-layer perceptron are r, and r is 16.
Preferably, in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, the weight ratio of each spatio-temporal feature matrix is 1: 1.
compared with the prior art, the invention has the beneficial effects that:
the method is characterized in that an acquired video to be identified is framed as an image, a plurality of image sequences are acquired at different sampling rates and used as multi-level input of a space-time feature network model, the acquired image sequences are subjected to time sequence modeling naturally, intermediate features extracted from corresponding image sequences are subjected to difference operation, the interference of a video background on action identification accuracy can be greatly reduced on the premise of not increasing calculated amount, sensitive information of actions in a time dimension can be effectively extracted by aggregating average pooling features and maximum pooling features, the whole video is subjected to global modeling, the robustness of the space-time feature network model can be continuously enhanced in the process, and therefore when each pooling information graph is aggregated, a space-time feature matrix output by each layer of feature extraction module can represent the extracted characteristics of one layer, the accuracy of motion video classification can be greatly improved; and by fusing a plurality of space-time characteristic matrixes, the understanding of the network model to the action video can be enhanced, the robustness is obviously improved, and the complex scene in real life can be dealt with.
Drawings
FIG. 1 is a flow chart of a motion recognition method of the present invention;
FIG. 2 is a general architecture diagram of the motion recognition method of the present invention;
FIG. 3 is a schematic diagram of a spatial difference module according to the present invention;
FIG. 4 is a schematic structural diagram of an attention timing module according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It is to be noted that, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
As shown in fig. 1-4, a method for identifying an action based on a video multipath spatiotemporal feature network includes the following steps:
and S1, acquiring the video to be identified, extracting a plurality of images from the video according to the frame rate, and preprocessing the images. The number of images (total number of Video frames) extracted from the Video (Video) is the number of Video frames per second (frame rate) multiplied by the total number of Video seconds.
In one embodiment, in step S1, the pre-processing is to randomly crop the image to [256,320] pixels in width and height.
And S2, extracting different numbers of images from the plurality of preprocessed images according to different sampling rates to form a plurality of image sequences.
Wherein m sampling rates are set to [ tau ] respectively12,...,τm]The images decimated at each sampling rate form an image sequence (Sample), wherein the dimensions of the m image sequences are respectively
Figure BDA0003584603520000051
Wherein T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image.
S3, establishing a space-time characteristic network model, wherein the space-time characteristic network model comprises a plurality of characteristic extraction modules, each image sequence is input to the characteristic extraction modules in a one-to-one correspondence manner, and the characteristic extraction modules execute the following operations:
s31, obtaining the middle characteristic X belonging to the R of the corresponding image sequenceN×T×C×H×WWherein N is the batch size, T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image;
s32, dividing the intermediate feature X equally into a first feature matrix X0And a second feature matrix X1Wherein, and calculating a difference value X1-X0As a difference feature, X0Is the first half of the middle feature X, X1The latter half of the intermediate feature X, X0、X1∈RN ×(T/2)×C×H×W
S33, sequentially passing the difference features through a maximum pooling layer, a first multilayer perceptron and a sigmoid layer to output spatial attention features;
s34, multiplying the spatial attention feature by the intermediate feature X point, and adding the multiplied spatial attention feature and the intermediate feature X to obtain a spatial feature map;
s35, inputting the spatial feature map into a parallel maximum pooling layer and an average pooling layer to correspondingly obtain a first maximum pooling feature map and a first average pooling feature map;
s36, inputting the first maximum pooling feature map and the first average pooling feature map into the second multilayer perceptron to obtain a second maximum pooling feature map and a second average pooling feature map correspondingly;
s37, connecting the second maximum pooling feature map and the second average pooling feature map with a second dimension through concat operation, and obtaining a fusion feature map through the convolution layer;
s38, the second maximum pooling feature map, the second average pooling feature map and the fusion feature map are respectively subjected to sigmoid layer correspondence to obtain a first pooling information map, a second pooling information map and a third pooling information map;
and S39, adding the first pooling information map, the second pooling information map and the third pooling information map to form a fourth pooling information map, multiplying the fourth pooling information map by the spatial feature map points, and adding the fourth pooling information map to the spatial feature map to output a space-time feature matrix.
In one embodiment, in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.
In an embodiment, in step S37, the second maximum pooling feature map and the second average pooling feature map are connected to each other by a concat operation, and obtaining the fused feature map by the convolution layer further includes an squeeze operation and an unsqueeze operation, where the convolution layer is a 1D convolution layer, and the squeeze operation, the concat operation, the 1D convolution layer and the unsqueeze operation are performed sequentially.
In one embodiment, the reduction coefficient and the amplification coefficient of the first multi-layer perceptron are r and 2r, and both the reduction coefficient and the amplification coefficient of the second multi-layer perceptron are r, and r is 16.
As shown in fig. 2, each feature extraction module includes a backhaul frame, a Spatial-Difference module (Spatial-Difference Modulation), and an Attention-timing module (Temporal-Attention Modulation), which are connected in sequence, and the backhaul frame adopts a network frame, for example. Let the ith image sequence be FiAnd will { F1,F2,...,FmAnd correspondingly taking the elements in the image sequence as the input of the m feature extraction modules, and acquiring the middle features X of the corresponding image sequence through a Backbone frame. In this embodiment, m is set to 2, N is 32,and the Batch Size (Batch _ Size) can be adjusted according to the actual requirement. The method comprises the following specific steps:
as shown in fig. 3, the spatial Difference module includes a first extraction unit (Difference operation), a max pooling layer (MaxPooling), a first multi-layer perceptron (MLP) and a SIGMOID layer (SIGMOID), and a second extraction unit (including dot multiplication and addition operations), and the first extraction unit is used to divide the intermediate features X into a first feature matrix X0And a second feature matrix X1And calculating a difference X1-X0And the difference features are extracted through subtraction operation, so that the interference of the motion recognition video background on the motion recognition accuracy can be greatly reduced under the condition of not increasing the calculation complexity. The difference characteristics sequentially pass through a maximum pooling layer, a first multilayer perceptron and a sigmoid layer to output spatial attention characteristics, and the difference of the front and rear characteristics is effectively extracted through the 3D maximum pooling layer to obtain Fmax∈RN×(T/2r)×1×1×1. Then F is mixedmax∈RN×(T/2r)×1×1×1Through the first multilayer perceptron, wherein the first multilayer perceptron comprises a first 3D convolution layer, a ReLU layer and a second 3D convolution layer which are connected in sequence, in order to reduce parameter overhead and improve feature extraction effect, the first multilayer perceptron is used for extracting Fmax∈RN×(T/2r)×1×1×1Reducing and then amplifying, wherein the reduction coefficient is r, the amplification coefficient is 2r, and if r is 16, F is obtainedmlp∈RN×T×1×1×1. F is to bemlpInputting the data into a sigmoid layer to obtain corresponding spatial attention characteristics. The Spatial Attention feature is multiplied by the intermediate feature X point by using a second extraction unit and then is added with the intermediate feature X to obtain a Spatial feature map (Spatial attribute), and a calculation formula of the Spatial feature map is as follows:
Y=X+X·(δ(MLP(Max(D(X))))
wherein X is an intermediate characteristic of the Backbone frame output, and D is a differential operation (i.e. X)1-X0) Max is the maximum pooling operation, MLP is the first multi-tier perceptron operation, and δ is the sigmoid operation. The above is the specific structure and operation of one spatial difference module, and other spatial difference modules are the same, and only correspond to different outputs, and the sizes of convolution kernels are different, which is not described herein again。
As shown in fig. 4, the attention timing module includes a max pooling layer (MaxPooling) and an average pooling layer (AvgPooling) in parallel, a second multi-layer perceptron (Shared-MLP), an squeze operation, a concat operation (C), a 1D convolution layer (1DCNN), an unsqueze operation, three sigmoid layers (sigmoids), and a third extraction unit (including addition, dot multiplication, addition). The spatial feature map is extracted by using a 3D maximum pooling layer to obtain a first maximum pooling feature map, and the spatial feature map is extracted by using a 3D average pooling layer to obtain a first average pooling feature map. And respectively extracting the first maximum pooling feature map and the first average pooling feature map by using a second multilayer perceptron, and correspondingly obtaining the second maximum pooling feature map and the second average pooling feature map, wherein the second multilayer perceptron is similar to the first multilayer perceptron in structure, but the reduction coefficient and the amplification coefficient are both r, and r is 16. And the second maximum pooling feature map and the second average pooling feature map are sequentially subjected to squeeze operation, concat operation, 1D convolutional layer and unsqueeze operation to obtain a fused feature map, and specifically, the second maximum pooling feature map and the second average pooling feature map are respectively subjected to squeeze operation to obtain F'maxAnd F'avgAll dimensions are RN×T×1. F 'connected by concat operation'maxAnd F'avgIs obtained from the second dimension of (1) to obtain Fios,Fios∈RN×2T×1. Then F is mixediosThe relationship between the average and maximum features is further increased by a 1D convolutional layer with a convolutional kernel size of (3, 3). And finally, restoring the original dimension of the image by applying unsqueeze operation to obtain a fusion feature map. Extracting the second maximum pooling feature map, the second average pooling feature map and the fusion feature map by utilizing the one-to-one correspondence of the three sigmoid layers to correspondingly obtain a first pooling information map Ftemp1∈RN ×T×1×1×1Second pooling information graph Ftemp2∈RN×T×1×1×1And a third pooled information graph Ftemp3∈RN×T×1×1×1. Adding the first, second and third pooling information maps to form a fourth pooling information map, multiplying the fourth pooling information map by the spatial feature map, and multiplying the fourth pooling information map by the spatial feature mapAnd adding and outputting a space-time characteristic matrix. The calculation formulas of the pooling information graphs and the space feature matrix are as follows:
Ftemp1=δ(SMLP(Max(X’)))
Ftemp2=δ(SMLP(Avg(X’)))
Ftemp3=δ(unsqueeze(Conv(squeeze([(Avg(X’),Max(X’)])))))
Y’=X’+X’·(Ftemp1+Ftemp2+Ftemp3)
wherein Y 'is a space-time feature matrix, X' is a spatial feature map, delta is a sigmoid operation, SMLP is a second multi-tier perceptron operation, Max is a Max pooling operation, Avg is an average pooling operation, Conv is a convolution operation, squeeze is an squeeze operation, and unsqueeze is an unsqueeze operation.
The attention time sequence module can effectively aggregate the average pooling characteristics and the maximum pooling characteristics to extract sensitive information of the action in the time dimension. The problem that the network in the prior art is not sensitive to the important characteristics of specific action behaviors in time sequence modeling can be remarkably improved, for example, in a shooting video, more attention should be paid to the position of a ball and a hand in the video, which changes along with time, rather than the wrong focusing of the network in the prior art on a body part of a sportsman. It should be noted that, the attention timing modules perform similar operations, which are not described herein again.
And S4, aggregating the space-time feature matrixes output by the feature extraction modules and outputting feature vectors.
In one embodiment, in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, the weight ratio of each spatio-temporal feature matrix is 1: 1.
the dimensionality of the space-time feature matrix output by each feature extraction module is the same, and the weight ratio is preferably 1: 1, performing addition operation (namely aggregation Fusion) in alignment, and fusing a plurality of space-time feature matrixes can help to enhance understanding of a network to a motion video, so that robustness is remarkably improved, and complex scenes in real life can be met.
And S5, classifying and detecting the feature vectors by using the classifier, and taking the class with the highest probability as a detection result. The classifier adopts a linear connection layer of a neural network, outputs the probability that the video to be identified belongs to each category, and takes the category with the highest probability as the action video result.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express the more specific and detailed embodiments described in the present application, but not be construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (6)

1. A motion identification method based on a video multipath space-time characteristic network is characterized by comprising the following steps: the action identification method based on the video multipath space-time characteristic network comprises the following steps:
s1, acquiring a video to be identified, extracting a plurality of images from the video according to the frame rate, and preprocessing the images;
s2, extracting different numbers of images from the preprocessed images respectively according to different sampling rates to form a plurality of image sequences;
s3, establishing a spatio-temporal feature network model, wherein the spatio-temporal feature network model comprises a plurality of feature extraction modules, each image sequence is input to the feature extraction modules in a one-to-one correspondence manner, and the feature extraction modules execute the following operations:
s31, obtaining the middle characteristic X of the corresponding image sequence belongs to RN×T×C×H×WWherein N is the batch size, T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image;
s32, dividing the intermediate feature X equally into a first feature matrix X0And a second feature matrix X1And calculating a difference X1-X0As a difference feature, wherein X0Is the first half of the middle feature X, X1The latter half of the intermediate feature X, X0、X1∈RN×(T/2)×C×H×W
S33, sequentially passing the difference features through a maximum pooling layer, a first multilayer perceptron and a sigmoid layer to output spatial attention features;
s34, multiplying the spatial attention feature by the intermediate feature X point, and adding the multiplied spatial attention feature and the intermediate feature X to obtain a spatial feature map;
s35, inputting the spatial feature map into a parallel maximum pooling layer and an average pooling layer to correspondingly obtain a first maximum pooling feature map and a first average pooling feature map;
s36, inputting the first maximum pooling feature map and the first average pooling feature map into the second multilayer perceptron to obtain a second maximum pooling feature map and a second average pooling feature map correspondingly;
s37, connecting the second maximum pooling feature map and the second average pooling feature map with a second dimension through concat operation, and obtaining a fusion feature map through the convolution layer;
s38, the second maximum pooling feature map, the second average pooling feature map and the fusion feature map are respectively subjected to sigmoid layer correspondence to obtain a first pooling information map, a second pooling information map and a third pooling information map;
s39, adding the first pooling information map, the second pooling information map and the third pooling information map to form a fourth pooling information map, multiplying the fourth pooling information map by the spatial feature map points, adding the fourth pooling information map to the spatial feature map, and outputting a space-time feature matrix;
s4, aggregating the space-time feature matrixes output by the feature extraction modules and outputting feature vectors;
and S5, classifying and detecting the feature vectors by using a classifier, and taking the class with the highest probability as a detection result.
2. The method of claim 1, wherein the method comprises: in step S1, the pre-processing is to randomly crop the image to [256,320] pixels in width and height.
3. The method of claim 1, wherein the method comprises: in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.
4. The method of claim 1, wherein the method comprises: in step S37, the second maximum pooling feature map and the second average pooling feature map are connected to a second dimension through a concat operation, and the fused feature map obtained by the convolutional layer further includes an squeeze operation and an unsqueeze operation, where the convolutional layer is a 1D convolutional layer, and the squeeze operation, the concat operation, the 1D convolutional layer, and the unsqueeze operation are performed in sequence.
5. The method of claim 1, wherein the method comprises: the reduction coefficient and the amplification coefficient of the first multilayer perceptron are both r and 2r, and r is 16.
6. The method of claim 1, wherein the method comprises: in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, the weight ratio of each spatio-temporal feature matrix is 1: 1.
CN202210362715.6A 2022-04-07 2022-04-07 Motion recognition method based on video multipath space-time characteristic network Active CN114648722B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210362715.6A CN114648722B (en) 2022-04-07 2022-04-07 Motion recognition method based on video multipath space-time characteristic network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210362715.6A CN114648722B (en) 2022-04-07 2022-04-07 Motion recognition method based on video multipath space-time characteristic network

Publications (2)

Publication Number Publication Date
CN114648722A true CN114648722A (en) 2022-06-21
CN114648722B CN114648722B (en) 2023-07-18

Family

ID=81997696

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210362715.6A Active CN114648722B (en) 2022-04-07 2022-04-07 Motion recognition method based on video multipath space-time characteristic network

Country Status (1)

Country Link
CN (1) CN114648722B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631050B (en) * 2023-04-20 2024-02-13 北京电信易通信息技术股份有限公司 Intelligent video conference-oriented user behavior recognition method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933417A (en) * 2015-06-26 2015-09-23 苏州大学 Behavior recognition method based on sparse spatial-temporal characteristics
CN111123257A (en) * 2019-12-30 2020-05-08 西安电子科技大学 Radar moving target multi-frame joint detection method based on graph space-time network
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
US20210201010A1 (en) * 2019-12-31 2021-07-01 Wuhan University Pedestrian re-identification method based on spatio-temporal joint model of residual attention mechanism and device thereof
CN113408343A (en) * 2021-05-12 2021-09-17 杭州电子科技大学 Classroom action recognition method based on double-scale space-time block mutual attention
CN113850182A (en) * 2021-09-23 2021-12-28 浙江理工大学 Action identification method based on DAMR-3 DNet
CN114037930A (en) * 2021-10-18 2022-02-11 苏州大学 Video action recognition method based on space-time enhanced network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933417A (en) * 2015-06-26 2015-09-23 苏州大学 Behavior recognition method based on sparse spatial-temporal characteristics
CN111123257A (en) * 2019-12-30 2020-05-08 西安电子科技大学 Radar moving target multi-frame joint detection method based on graph space-time network
US20210201010A1 (en) * 2019-12-31 2021-07-01 Wuhan University Pedestrian re-identification method based on spatio-temporal joint model of residual attention mechanism and device thereof
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN113408343A (en) * 2021-05-12 2021-09-17 杭州电子科技大学 Classroom action recognition method based on double-scale space-time block mutual attention
CN113850182A (en) * 2021-09-23 2021-12-28 浙江理工大学 Action identification method based on DAMR-3 DNet
CN114037930A (en) * 2021-10-18 2022-02-11 苏州大学 Video action recognition method based on space-time enhanced network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱威: "《基于时空相关性的HEVC帧间模式决策快速算法》", 《通信学报》, vol. 37, no. 4, pages 64 - 73 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631050B (en) * 2023-04-20 2024-02-13 北京电信易通信息技术股份有限公司 Intelligent video conference-oriented user behavior recognition method and system

Also Published As

Publication number Publication date
CN114648722B (en) 2023-07-18

Similar Documents

Publication Publication Date Title
Kim et al. Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment
Liu et al. Robust video super-resolution with learned temporal dynamics
CN111768432B (en) Moving target segmentation method and system based on twin deep neural network
Remez et al. Class-aware fully convolutional Gaussian and Poisson denoising
CN107977932B (en) Face image super-resolution reconstruction method based on discriminable attribute constraint generation countermeasure network
CN109271960B (en) People counting method based on convolutional neural network
Linardos et al. Simple vs complex temporal recurrences for video saliency prediction
CN110378288B (en) Deep learning-based multi-stage space-time moving target detection method
CN109993269B (en) Single image crowd counting method based on attention mechanism
WO2022121485A1 (en) Image multi-tag classification method and apparatus, computer device, and storage medium
CN112465727A (en) Low-illumination image enhancement method without normal illumination reference based on HSV color space and Retinex theory
CN113255616B (en) Video behavior identification method based on deep learning
TWI761813B (en) Video analysis method and related model training methods, electronic device and storage medium thereof
CN112257526B (en) Action recognition method based on feature interactive learning and terminal equipment
Prabhushankar et al. Ms-unique: Multi-model and sharpness-weighted unsupervised image quality estimation
CN112446342A (en) Key frame recognition model training method, recognition method and device
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN111079864A (en) Short video classification method and system based on optimized video key frame extraction
Steffens et al. Cnn based image restoration: Adjusting ill-exposed srgb images in post-processing
CN113936309A (en) Facial block-based expression recognition method
Saleh et al. Adaptive uncertainty distribution in deep learning for unsupervised underwater image enhancement
CN113011253A (en) Face expression recognition method, device, equipment and storage medium based on ResNeXt network
Zhu et al. Ultra-high temporal resolution visual reconstruction from a fovea-like spike camera via spiking neuron model
CN111310516B (en) Behavior recognition method and device
CN110443296B (en) Hyperspectral image classification-oriented data adaptive activation function learning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant