CN114648722B

CN114648722B - Motion recognition method based on video multipath space-time characteristic network

Info

Publication number: CN114648722B
Application number: CN202210362715.6A
Authority: CN
Inventors: 张海平; 胡泽鹏; 刘旭; 马琮皓; 管力明; 施月玲
Original assignee: Hangzhou Dianzi University; School of Information Engineering of Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University; School of Information Engineering of Hangzhou Dianzi University
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2023-07-18
Anticipated expiration: 2042-04-07
Also published as: CN114648722A

Abstract

The invention discloses an action recognition method based on a video multipath space-time characteristic network, which comprises the following steps: acquiring a video to be identified, extracting a plurality of images from the video according to the frame rate, and preprocessing the images; extracting different numbers of images from the preprocessed images according to different sampling rates to form a plurality of image sequences; establishing a space-time feature network model, wherein the space-time feature network model comprises a plurality of feature extraction modules, and each image sequence is input to the feature extraction module in a one-to-one correspondence manner to obtain a space-time feature matrix; aggregating the space-time feature matrixes output by the feature extraction modules to output feature vectors; and classifying and detecting the feature vectors by using a classifier, wherein the class with the highest probability is used as a detection result. The method can greatly improve the accuracy of classifying the action videos, is beneficial to enhancing the understanding of the network model to the action videos, and remarkably improves the robustness, so that the method can cope with complex scenes in real life.

Description

Motion recognition method based on video multipath space-time characteristic network

Technical Field

The invention belongs to the field of deep learning video understanding, and particularly relates to an action recognition method based on a video multipath space-time feature network.

Background

The rapid growth of the video market benefits from technical innovations in mobile internet and intelligent digital devices, etc. Today, smart mobile devices can store thousands of videos, and mobile applications allow users to conveniently access hundreds of video websites through the mobile internet. Video is therefore becoming increasingly important in many areas. For example, the action recognition can be applied to uploading and auditing of a large number of videos of a website every day, is used for video monitoring of dangerous actions and dangerous behaviors, and is even applied to the fields of robot action technology and the like. However, conventional deep learning methods often involve problems of low accuracy and slow speed, especially when dealing with large numbers of video scenes and complex action video scenes.

In current artificial intelligence deep learning methods, action classification is typically achieved through two mechanisms. One approach is to use a dual stream network, where one stream is located on an RGB frame for extracting spatial information and the other is to capture temporal information using optical flow as input. The addition of the optical flow module in the double-flow mode can greatly improve the accuracy of motion recognition, but the optical flow has quite high calculation cost. Another approach is to learn spatiotemporal features from multi-frame RGB images by 3D convolution. The 3DCNN can effectively extract the spatio-temporal information, but since the spatio-temporal information is extracted together, this type of network lacks a specific consideration for the time dimension, and also cannot acquire a specific front-rear motion difference according to the optical flow information like in the dual-flow network, much important information is lost in the process of extracting the features. Therefore, it remains a challenge to separate temporal information from spatial information better in 3DCNN networks so that they express their respective characteristic information more explicitly. In particular also in the extraction of spatial and temporal information in video clips. The spatial information represents static information in a single-frame scene, such as information of action entities, related specific action forms and the like in a video; the temporal information represents the integration of spatial information over multiple frames to obtain motion context related information. Therefore, it is necessary to design an effective deep learning method for these two parts to improve the accuracy of motion recognition.

Disclosure of Invention

The invention aims to solve the problems, and provides an action recognition method based on a video multipath space-time characteristic network, which can greatly improve the accuracy of action video classification, is beneficial to enhancing the understanding of a network model on the action video, and remarkably improves the robustness, so that the method can cope with complex scenes in real life.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

the invention provides a motion recognition method based on a video multipath space-time characteristic network, which comprises the following steps:

s1, acquiring a video to be identified, extracting a plurality of images from the video according to a frame rate, and preprocessing the images;

s2, respectively extracting different numbers of images from the preprocessed images according to different sampling rates to form a plurality of image sequences;

s3, establishing a space-time feature network model, wherein the space-time feature network model comprises a plurality of feature extraction modules, each image sequence is input to the feature extraction modules in a one-to-one correspondence mode, and the feature extraction modules execute the following operations:

s31, obtaining intermediate characteristics X epsilon R of corresponding image sequences ^{N×T×C×H×W} Wherein N is the batch size, T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image;

s32, uniformly dividing the intermediate feature X into a first feature matrix X ₀ And a second feature matrix X ₁ And calculate the difference X ₁ -X ₀ As a difference feature, wherein X ₀ Is the first half of the intermediate feature X, X ₁ Is the latter half of the intermediate feature X, X ₀ 、X ₁ ∈R ^N ^{×(T/2)×C×H×W} ；

S33, outputting the spatial attention features by sequentially passing the difference features through the maximum pooling layer, the first multi-layer perceptron and the sigmoid layer;

s34, multiplying the spatial attention characteristic with the intermediate characteristic X point and then adding the spatial attention characteristic with the intermediate characteristic X point to obtain a spatial characteristic diagram;

s35, inputting the spatial feature map into a parallel maximum pooling layer and an average pooling layer to correspondingly obtain a first maximum pooling feature map and a first average pooling feature map;

s36, inputting the first maximum pooling feature map and the first average pooling feature map into a second multi-layer perceptron to correspondingly obtain a second maximum pooling feature map and a second average pooling feature map;

s37, connecting the second maximum pooling feature map and the second average pooling feature map with a second dimension through concat operation, and obtaining a fusion feature map through a convolution layer;

s38, respectively and correspondingly obtaining a first pooling information graph, a second pooling information graph and a third pooling information graph through a sigmoid layer by the second maximum pooling feature graph, the second average pooling feature graph and the fusion feature graph;

s39, adding the first pooling information diagram, the second pooling information diagram and the third pooling information diagram to form a fourth pooling information diagram, multiplying the fourth pooling information diagram by the space feature diagram point, adding the fourth pooling information diagram and the space feature diagram point, and outputting a space-time feature matrix;

s4, aggregating the space-time feature matrixes output by the feature extraction modules to output feature vectors;

and S5, classifying and detecting the feature vector by using a classifier, and taking the class with the highest probability as a detection result.

Preferably, in step S1, the preprocessing is to randomly crop the image to a width and height of [256,320] pixels.

Preferably, in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.

Preferably, in step S37, the second maximum pooling feature map and the second average pooling feature map are connected to the second dimension through a concat operation, and the obtaining the fusion feature map through a convolution layer further includes a squeeze operation and an unsqueeze operation, where the convolution layer is a 1D convolution layer, and the squeeze operation, the concat operation, the 1D convolution layer and the unsqueeze operation are sequentially performed.

Preferably, the first multi-layer perceptron has a reduction coefficient r and an amplification coefficient 2r, and the second multi-layer perceptron has a reduction coefficient r and an amplification coefficient r, where r=16.

Preferably, in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, the weight ratio of each spatio-temporal feature matrix is 1:1.

compared with the prior art, the invention has the beneficial effects that:

according to the method, the acquired video frames to be identified are taken as images, a plurality of image sequences are acquired at different sampling rates and are used as multi-level input of a space-time characteristic network model, time sequence modeling is naturally carried out on the acquired image sequences, and difference operation is carried out on intermediate characteristics extracted from corresponding image sequences, so that the interference of video background on motion identification accuracy can be greatly reduced on the premise of not increasing calculated amount, sensitive information of motion in time dimension can be effectively aggregated by average pooling characteristics and maximum pooling characteristics, global modeling is carried out on the whole video, and in the process, the robustness of the space-time characteristic network model can be continuously enhanced, so that the extracted characteristics of each layer of the space-time characteristic matrix output by each layer of characteristic extraction module can be represented when the pooling information graphs are aggregated, and the accuracy of motion video classification can be greatly improved; and by fusing a plurality of space-time feature matrixes, the method can help to enhance the understanding of the network model to the action video, and remarkably improve the robustness, so that the method can cope with complex scenes in real life.

Drawings

FIG. 1 is a flow chart of the motion recognition method of the present invention;

FIG. 2 is a general architecture diagram of the motion recognition method of the present invention;

FIG. 3 is a schematic diagram of a spatial differential module according to the present invention;

fig. 4 is a schematic structural diagram of an attention timing module according to the present invention.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It is noted that unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

As shown in fig. 1-4, a motion recognition method based on a video multipath space-time feature network includes the following steps:

s1, acquiring a video to be identified, extracting a plurality of images from the video according to a frame rate, and preprocessing the images. The number of images extracted from Video (Video) (total number of frames of Video) is the number of frames per second of Video (frame rate) times the total number of seconds of Video.

In one embodiment, in step S1, the preprocessing is to randomly crop the image to a width and height of [256,320] pixels.

S2, respectively extracting different numbers of images from the preprocessed images according to different sampling rates to form a plurality of image sequences.

Wherein, m sampling rates are respectively set as [ tau ] ₁ ，τ ₂ ，...，τ _m ]The decimated images at each sampling rate form a sequence of images (Sample), wherein the m sequences of images have dimensions ofWherein T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image.

s32, uniformly dividing the intermediate feature X into a first feature matrix X ₀ And a second feature matrix X ₁ Wherein, and calculate the difference X ₁ -X ₀ As a difference feature, X ₀ Is the first half of the intermediate feature X, X ₁ Is the latter half of the intermediate feature X, X ₀ 、X ₁ ∈R ^N ^{×(T/2)×C×H×W} ；

s39, adding the first pooling information diagram, the second pooling information diagram and the third pooling information diagram to form a fourth pooling information diagram, multiplying the fourth pooling information diagram by the space feature diagram point, adding the fourth pooling information diagram and the space feature diagram point, and outputting a space-time feature matrix.

In one embodiment, in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.

In one embodiment, in step S37, the second maximum pooling feature map and the second average pooling feature map are passed through _co n _ca the t operation is connected with the second dimension, and the fusion feature map obtained through the convolution layer further comprises a squeeze operation and an unsqueeze operation, wherein the convolution layer is a 1D convolution layer, and the squeeze operation, the concat operation, the 1D convolution layer and the unsqueeze operation are sequentially carried out.

In an embodiment, the first multi-layer perceptron has a reduction coefficient r and an amplification coefficient 2r, and the second multi-layer perceptron has a reduction coefficient r and an amplification coefficient r, where r=16.

Each feature extraction module includes, as shown in fig. 2, a backhaul framework, such as a res net framework, that is sequentially connected to each other, a Spatial-differential module (Spatial-Difference Modulation), and a attention-sequential module (Temporal-Attention Modulation). The ith image sequence is denoted as F _i And will { F ₁ ，F ₂ ，...，F _m The elements in the sequence are in one-to-one correspondence as the input of m feature extraction modules, and the intermediate features X of the corresponding image sequence are acquired through a backstone framework. In this embodiment, m is set to 2, n=32, and the Batch Size (batch_size) can be adjusted according to actual requirements. The method comprises the following steps:

as shown in fig. 3, the spatial difference module comprises a first extraction unit (Difference operation), a maximum pooling layer (MaxPooling), a first multi-layer perceptron (MLP) and a SIGMOID layer (SIGMOID), and a second extraction unit (including point multiplication and addition operations), wherein the first extraction unit is used to uniformly divide the intermediate feature X into a first feature matrix X ₀ And a second feature matrix X ₁ And calculate the difference X ₁ -X ₀ By extracting the difference characteristics through the subtraction operation, the interference of the motion recognition video background on the motion recognition accuracy can be greatly reduced under the condition of not increasing the computational complexity. The difference features sequentially pass through a maximum pooling layer, a first multi-layer perceptron and sigThe moid layer outputs spatial attention characteristics, and front-back characteristic differences are effectively extracted through the 3D maximum pooling layer to obtain F _max ∈R ^{N×(T/2r)×1×1×1} . And then F is arranged _max ∈R ^{N×(T/2r)×1×1×1} Through a first multi-layer perceptron, wherein the first multi-layer perceptron comprises a first 3D convolution layer, a ReLU layer and a second 3D convolution layer which are sequentially connected, in order to reduce parameter overhead and improve feature extraction effect, the first multi-layer perceptron uses F _max ∈R ^{N×(T/2r)×1×1×1} Firstly, shrinking and then amplifying, wherein the reduction coefficient is r, and the amplification coefficient is 2r, if r=16, F is obtained _mlp ∈R ^{N×T×1×1×1} . Will F _mlp Input to the sigmoid layer obtains corresponding spatial attention features. The second extraction unit multiplies the spatial attention feature by the intermediate feature X point and then adds the spatial attention feature to the intermediate feature X to obtain a spatial feature map (Spatial Attention), wherein the spatial feature map has the following calculation formula:

Y＝X+X·δ(MLP(Max(D(X))))

where X is the intermediate feature of the backbond frame output and D is the differential operation (i.e., X ₁ -X ₀ ) Max is the Max pooling operation, MLP is the first multi-layer perceptron operation, delta is the sigmoid operation. The above is a specific structure and operation of one spatial differential module, and other spatial differential modules are similar, only correspond to different outputs, and the convolution kernel size is different, which is not described herein.

As shown in fig. 4, the attention timing module includes a parallel maximum pooling layer (MaxPooling) and average pooling layer (AvgPooling), a second multi-layer perceptron (Shared-MLP), a squeeze operation, a concat operation (C), a 1D convolutional layer (1 DCNN), an unsqueeze operation, three sigmoid layers (sigmoid), and a third extraction unit (including addition, dot multiplication, addition). The method comprises the steps of extracting a space feature map by using a 3D maximum pooling layer to obtain a first maximum pooling feature map, and extracting the space feature map by using a 3D average pooling layer to obtain a first average pooling feature map. Respectively extracting the first maximum pooling feature map and the first average pooling feature map by using a second multi-layer perceptron to correspondingly obtain a second maximum pooling feature map and a second average pooling feature mapThe second multi-layer perceptron is similar in structure to the first multi-layer perceptron, but the reduction and amplification coefficients are r, r=16. The second maximum pooling feature map and the second average pooling feature map are subjected to squeeze operation, concat operation, 1D convolution layer and unsqueeze operation in sequence to obtain a fusion feature map, and specifically, the second maximum pooling feature map and the second average pooling feature map are subjected to squeeze operation respectively to correspondingly obtain F '' _max With F' _avg The dimensions are R ^N×T×1 . Ligation F 'by concat operation' _max And F' _avg Obtain F from the second dimension of (2) _ios ，F _ios ∈R ^N×2T×1 . And then F is arranged _ios The link between the average and maximum features is further increased by a 1D convolution layer with convolution kernel size (3, 3). And finally, restoring the original dimension by using an unsqueeze operation to obtain a fusion feature map. Extracting a second maximum pooling feature map, a second average pooling feature map and a fusion feature map by utilizing one-to-one correspondence of three sigmoid layers, and correspondingly obtaining a first pooling information map F _temp1 ∈R ^N ^×T×1×1×1 Second pooled information map F _temp2 ∈R ^{N×T×1×1×1} And a third pooled information map F _temp3 ∈R ^{N×T×1×1×1} . And adding the first pooling information diagram, the second pooling information diagram and the third pooling information diagram to form a fourth pooling information diagram, multiplying the fourth pooling information diagram by the space feature diagram point, adding the fourth pooling information diagram and the space feature diagram, and outputting a space-time feature matrix. The calculation formula of each pooled information graph and the space-time feature matrix is as follows:

F _temp1 ＝δ(SMLP(Max(X’)))

F _temp2 ＝δ(SMLP(Avg(X’)))

F _temp3 ＝δ(unsqueeze(Conv(squeeze([(Avg(X’)，Max(X’)])))))

Y’＝X’+X’·(F _temp1 +F _temp2 +F _temp3 )

wherein Y 'is a space-time feature matrix, X' is a space feature map, delta is a sigmoid operation, SMLP is a second multi-layer perceptron operation, max is a maximum pooling operation, avg is an average pooling operation, conv is a convolution operation, squeeze is a squeeze operation, and unsqueeze is an unsqueeze operation.

The attention time sequence module can effectively aggregate the average pooling feature and the maximum pooling feature to extract the sensitive information of the action in the time dimension. The problem of prior art network insensitivity to specific action behavior important features in time series modeling can be significantly improved, for example, in a shot video, more attention should be paid to the position of the ball and hand over time in the video, rather than focusing on the body part of the athlete as in prior art network errors. It should be noted that, each attention sequential module performs similar operations, and will not be described herein.

And S4, aggregating the space-time feature matrixes output by the feature extraction modules to output feature vectors.

In one embodiment, in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, the weight ratio of each spatio-temporal feature matrix is 1:1.

The dimensions of the space-time feature matrix output by each feature extraction module are the same, and the weight ratio is preferably 1:1, performing addition operation (i.e. aggregation Fusion) in alignment, and fusing a plurality of space-time feature matrixes can help to enhance understanding of a network to an action video, so that robustness is remarkably improved, and complex scenes in real life are dealt with.

And S5, classifying and detecting the feature vector by using a classifier, and taking the class with the highest probability as a detection result. The classifier adopts a linear connection layer of a neural network, outputs the probability that the video to be recognized belongs to each category, and takes the category with the highest probability as an action video result.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above-described embodiments are merely representative of the more specific and detailed embodiments described herein and are not to be construed as limiting the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A motion recognition method based on a video multipath space-time characteristic network is characterized by comprising the following steps of: the motion recognition method based on the video multipath space-time characteristic network comprises the following steps:

s3, establishing a space-time characteristic network model, wherein the space-time characteristic network model comprises a plurality of characteristic extraction modules, each image sequence is input to the characteristic extraction module in a one-to-one correspondence manner, and the characteristic extraction module performs the following operations:

s32, uniformly dividing the intermediate feature X into a first feature matrix X ₀ And a second feature matrix X ₁ And calculate the difference X ₁ -X ₀ As a difference feature, wherein X ₀ Is the first half of the intermediate feature X, X ₁ Is the latter half of the intermediate feature X, X ₀ 、X ₁ ∈R ^{N×(T/2)×C×H×W} ；

the calculation formula of the space feature map is as follows:

Y＝X+X·δ(MLP(Max(D(X))))

wherein D (X) =x ₁ -X ₀ Max is the maximum pooling operation, MLP is the first multi-layer perceptron operation, delta is the sigmoid operation;

s39, adding the first pooling information diagram, the second pooling information diagram and the third pooling information diagram to form a fourth pooling information diagram, multiplying the fourth pooling information diagram with the space feature diagram point, adding the fourth pooling information diagram with the space feature diagram, and outputting a space-time feature matrix;

and S5, classifying and detecting the feature vector by using a classifier, wherein the class with the highest probability is used as a detection result.

2. The method for motion recognition based on a video multipath spatio-temporal feature network of claim 1, characterized in that: in step S1, the preprocessing is to randomly clip the image to the pixels with the width and the height of [256,320 ].

3. The method for motion recognition based on a video multipath spatio-temporal feature network of claim 1, characterized in that: in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.

4. The method for motion recognition based on a video multipath spatio-temporal feature network of claim 1, characterized in that: in step S37, the second maximum pooling feature map and the second average pooling feature map are connected to the second dimension through a concat operation, and the obtaining of the fusion feature map through a convolution layer further includes a squeeze operation and an unsqueeze operation, where the convolution layer is a 1D convolution layer, and the squeeze operation, the concat operation, the 1D convolution layer and the unsqueeze operation are sequentially performed.

5. The method for motion recognition based on a video multipath spatio-temporal feature network of claim 1, characterized in that: the reduction coefficient and the amplification coefficient of the first multi-layer perceptron are r and 2r, and the reduction coefficient and the amplification coefficient of the second multi-layer perceptron are r and r=16.

6. The method for motion recognition based on a video multipath spatio-temporal feature network of claim 1, characterized in that: in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, a weight ratio of each spatio-temporal feature matrix is 1:1.