CN114648722A

CN114648722A - Action identification method based on video multipath space-time characteristic network

Info

Publication number: CN114648722A
Application number: CN202210362715.6A
Authority: CN
Inventors: 张海平; 胡泽鹏; 刘旭; 马琮皓; 管力明; 施月玲
Original assignee: Hangzhou Dianzi University; School of Information Engineering of Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University; School of Information Engineering of Hangzhou Dianzi University
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2022-06-21
Anticipated expiration: 2042-04-07
Also published as: CN114648722B

Abstract

The invention discloses a motion identification method based on a video multipath space-time characteristic network, which comprises the following steps: acquiring a video to be identified, extracting a plurality of images from the video according to a frame rate, and preprocessing the images; extracting different numbers of images from the preprocessed images respectively according to different sampling rates to form a plurality of image sequences; establishing a space-time characteristic network model, wherein the space-time characteristic network model comprises a plurality of characteristic extraction modules, and all image sequences are input to the characteristic extraction modules in a one-to-one correspondence manner to obtain a space-time characteristic matrix; aggregating the space-time characteristic matrixes output by the characteristic extraction modules and outputting characteristic vectors; and (4) carrying out classification detection on the feature vectors by using a classifier, and taking the class with the highest probability as a detection result. The method can greatly improve the accuracy of motion video classification, is beneficial to enhancing the understanding of a network model to the motion video, and obviously improves the robustness, thereby being capable of dealing with complex scenes in real life.

Description

Action identification method based on video multipath space-time characteristic network

Technical Field

The invention belongs to the field of deep learning video understanding, and particularly relates to an action identification method based on a video multipath spatiotemporal feature network.

Background

The rapid growth of the video market benefits from technological innovations in mobile internet and intelligent digital devices, etc. Today, smart mobile devices can store thousands of videos, and mobile applications allow users to conveniently access hundreds of video websites over the mobile internet. Video is therefore becoming increasingly important in many areas. For example, the action recognition can be applied to uploading and auditing of a large number of videos of a website every day, is used for monitoring dangerous actions and dangerous behaviors through videos, and is even applied to the fields of robot action technology and the like. However, the conventional deep learning method generally involves problems of low precision and slow speed, and is not satisfactory especially when a large number of video scenes and complex motion video scenes are processed.

In current artificial intelligence deep learning methods, action classification is typically implemented by two mechanisms. One approach is to use a dual stream network, where one stream is placed over the RGB frames for extracting spatial information, and the other is to capture temporal information using optical flow as input. The addition of the optical flow module in the dual-flow mode can greatly improve the accuracy of motion recognition, but the calculation cost of the optical flow is very expensive. Another approach is to learn spatio-temporal features from multiple frames of RGB images by 3D convolution. The 3DCNN can effectively extract spatio-temporal information, but since spatio-temporal information is extracted together, this type of network lacks specific consideration on the time dimension, and cannot acquire specific pre-post action differences according to optical flow information like in a dual-flow network, and much important information is lost in the process of extracting features. Therefore, it is still a challenge how to better separate the temporal information and the spatial information in the 3DCNN network so that they can express their respective characteristic information more clearly. It also consists in the extraction of spatial and temporal information in video segments. The spatial information represents static information in a single-frame scene, such as information of action entities in a video, related specific action forms and the like; temporal information represents the integration of spatial information over multiple frames to obtain action context related information. Therefore, it is necessary to design an effective deep learning method for the two parts to improve the accuracy of motion recognition.

Disclosure of Invention

The invention aims to provide a motion identification method based on a video multipath space-time characteristic network, which can greatly improve the accuracy of motion video classification, is beneficial to enhancing the understanding of a network model to motion videos, and obviously improves the robustness, so that the method can cope with complex scenes in real life.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the invention provides a motion identification method based on a video multipath space-time characteristic network, which comprises the following steps:

s1, acquiring a video to be identified, extracting a plurality of images from the video according to the frame rate, and preprocessing the images;

s2, extracting different numbers of images from the preprocessed images respectively according to different sampling rates to form a plurality of image sequences;

s3, establishing a space-time characteristic network model, wherein the space-time characteristic network model comprises a plurality of characteristic extraction modules, the image sequences are input into the characteristic extraction modules in a one-to-one correspondence manner, and the characteristic extraction modules execute the following operations:

s31, obtaining the middle characteristic X belonging to the R of the corresponding image sequence^{N×T×C×H×W}Wherein N is the batch size, T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image;

s32, dividing the intermediate feature X equally into a first feature matrix X₀And a second feature matrix X₁And calculating a difference X₁-X₀As a difference feature, wherein X₀Is the first half of the middle feature X, X₁The latter half of the intermediate feature X, X₀、X₁∈R^N ^{×(T/2)×C×H×W}；

S33, sequentially passing the difference features through a maximum pooling layer, a first multilayer perceptron and a sigmoid layer to output spatial attention features;

s34, multiplying the spatial attention feature by the intermediate feature X point, and adding the multiplied spatial attention feature and the intermediate feature X to obtain a spatial feature map;

s35, inputting the spatial feature map into a parallel maximum pooling layer and an average pooling layer to correspondingly obtain a first maximum pooling feature map and a first average pooling feature map;

s36, inputting the first maximum pooling feature map and the first average pooling feature map into the second multilayer perceptron to obtain a second maximum pooling feature map and a second average pooling feature map correspondingly;

s37, connecting the second maximum pooling feature map and the second average pooling feature map with a second dimension through concat operation, and obtaining a fusion feature map through the convolution layer;

s38, the second maximum pooling feature map, the second average pooling feature map and the fusion feature map are respectively subjected to sigmoid layer correspondence to obtain a first pooling information map, a second pooling information map and a third pooling information map;

s39, adding the first pooled information map, the second pooled information map and the third pooled information map to form a fourth pooled information map, multiplying the fourth pooled information map by the spatial feature map, adding the fourth pooled information map to the spatial feature map, and outputting a space-time feature matrix;

s4, aggregating the space-time feature matrixes output by the feature extraction modules and outputting feature vectors;

and S5, classifying and detecting the feature vectors by using the classifier, and taking the class with the highest probability as a detection result.

Preferably, in step S1, the pre-processing is to randomly crop the image to [256,320] pixels in width and height.

Preferably, in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.

Preferably, in step S37, the second maximum pooling feature map and the second average pooling feature map are connected to a second dimension by a concat operation, and obtaining the fused feature map by the convolution layer further includes an squeeze operation and an unsqueeze operation, where the convolution layer is a 1D convolution layer, and the squeeze operation, the concat operation, the 1D convolution layer and the unsqueeze operation are performed sequentially.

Preferably, the reduction coefficient and the amplification coefficient of the first multi-layer perceptron are r and 2r, and both the reduction coefficient and the amplification coefficient of the second multi-layer perceptron are r, and r is 16.

Preferably, in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, the weight ratio of each spatio-temporal feature matrix is 1: 1.

compared with the prior art, the invention has the beneficial effects that:

the method is characterized in that an acquired video to be identified is framed as an image, a plurality of image sequences are acquired at different sampling rates and used as multi-level input of a space-time feature network model, the acquired image sequences are subjected to time sequence modeling naturally, intermediate features extracted from corresponding image sequences are subjected to difference operation, the interference of a video background on action identification accuracy can be greatly reduced on the premise of not increasing calculated amount, sensitive information of actions in a time dimension can be effectively extracted by aggregating average pooling features and maximum pooling features, the whole video is subjected to global modeling, the robustness of the space-time feature network model can be continuously enhanced in the process, and therefore when each pooling information graph is aggregated, a space-time feature matrix output by each layer of feature extraction module can represent the extracted characteristics of one layer, the accuracy of motion video classification can be greatly improved; and by fusing a plurality of space-time characteristic matrixes, the understanding of the network model to the action video can be enhanced, the robustness is obviously improved, and the complex scene in real life can be dealt with.

Drawings

FIG. 1 is a flow chart of a motion recognition method of the present invention;

FIG. 2 is a general architecture diagram of the motion recognition method of the present invention;

FIG. 3 is a schematic diagram of a spatial difference module according to the present invention;

FIG. 4 is a schematic structural diagram of an attention timing module according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is to be noted that, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

As shown in fig. 1-4, a method for identifying an action based on a video multipath spatiotemporal feature network includes the following steps:

and S1, acquiring the video to be identified, extracting a plurality of images from the video according to the frame rate, and preprocessing the images. The number of images (total number of Video frames) extracted from the Video (Video) is the number of Video frames per second (frame rate) multiplied by the total number of Video seconds.

In one embodiment, in step S1, the pre-processing is to randomly crop the image to [256,320] pixels in width and height.

And S2, extracting different numbers of images from the plurality of preprocessed images according to different sampling rates to form a plurality of image sequences.

Wherein m sampling rates are set to [ tau ] respectively₁,τ₂,...,τ_m]The images decimated at each sampling rate form an image sequence (Sample), wherein the dimensions of the m image sequences are respectively

Wherein T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image.

S3, establishing a space-time characteristic network model, wherein the space-time characteristic network model comprises a plurality of characteristic extraction modules, each image sequence is input to the characteristic extraction modules in a one-to-one correspondence manner, and the characteristic extraction modules execute the following operations:

s32, dividing the intermediate feature X equally into a first feature matrix X₀And a second feature matrix X₁Wherein, and calculating a difference value X₁-X₀As a difference feature, X₀Is the first half of the middle feature X, X₁The latter half of the intermediate feature X, X₀、X₁∈R^N ^{×(T/2)×C×H×W}；

and S39, adding the first pooling information map, the second pooling information map and the third pooling information map to form a fourth pooling information map, multiplying the fourth pooling information map by the spatial feature map points, and adding the fourth pooling information map to the spatial feature map to output a space-time feature matrix.

In one embodiment, in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.

In an embodiment, in step S37, the second maximum pooling feature map and the second average pooling feature map are connected to each other by a concat operation, and obtaining the fused feature map by the convolution layer further includes an squeeze operation and an unsqueeze operation, where the convolution layer is a 1D convolution layer, and the squeeze operation, the concat operation, the 1D convolution layer and the unsqueeze operation are performed sequentially.

In one embodiment, the reduction coefficient and the amplification coefficient of the first multi-layer perceptron are r and 2r, and both the reduction coefficient and the amplification coefficient of the second multi-layer perceptron are r, and r is 16.

As shown in fig. 2, each feature extraction module includes a backhaul frame, a Spatial-Difference module (Spatial-Difference Modulation), and an Attention-timing module (Temporal-Attention Modulation), which are connected in sequence, and the backhaul frame adopts a network frame, for example. Let the ith image sequence be F_iAnd will { F₁,F₂,...,F_mAnd correspondingly taking the elements in the image sequence as the input of the m feature extraction modules, and acquiring the middle features X of the corresponding image sequence through a Backbone frame. In this embodiment, m is set to 2, N is 32,and the Batch Size (Batch _ Size) can be adjusted according to the actual requirement. The method comprises the following specific steps:

as shown in fig. 3, the spatial Difference module includes a first extraction unit (Difference operation), a max pooling layer (MaxPooling), a first multi-layer perceptron (MLP) and a SIGMOID layer (SIGMOID), and a second extraction unit (including dot multiplication and addition operations), and the first extraction unit is used to divide the intermediate features X into a first feature matrix X₀And a second feature matrix X₁And calculating a difference X₁-X₀And the difference features are extracted through subtraction operation, so that the interference of the motion recognition video background on the motion recognition accuracy can be greatly reduced under the condition of not increasing the calculation complexity. The difference characteristics sequentially pass through a maximum pooling layer, a first multilayer perceptron and a sigmoid layer to output spatial attention characteristics, and the difference of the front and rear characteristics is effectively extracted through the 3D maximum pooling layer to obtain F_max∈R^{N×(T/2r)×1×1×1}. Then F is mixed_max∈R^{N×(T/2r)×1×1×1}Through the first multilayer perceptron, wherein the first multilayer perceptron comprises a first 3D convolution layer, a ReLU layer and a second 3D convolution layer which are connected in sequence, in order to reduce parameter overhead and improve feature extraction effect, the first multilayer perceptron is used for extracting F_max∈R^{N×(T/2r)×1×1×1}Reducing and then amplifying, wherein the reduction coefficient is r, the amplification coefficient is 2r, and if r is 16, F is obtained_mlp∈R^{N×T×1×1×1}. F is to be_mlpInputting the data into a sigmoid layer to obtain corresponding spatial attention characteristics. The Spatial Attention feature is multiplied by the intermediate feature X point by using a second extraction unit and then is added with the intermediate feature X to obtain a Spatial feature map (Spatial attribute), and a calculation formula of the Spatial feature map is as follows:

Y＝X+X·(δ(MLP(Max(D(X))))

wherein X is an intermediate characteristic of the Backbone frame output, and D is a differential operation (i.e. X)₁-X₀) Max is the maximum pooling operation, MLP is the first multi-tier perceptron operation, and δ is the sigmoid operation. The above is the specific structure and operation of one spatial difference module, and other spatial difference modules are the same, and only correspond to different outputs, and the sizes of convolution kernels are different, which is not described herein again。

As shown in fig. 4, the attention timing module includes a max pooling layer (MaxPooling) and an average pooling layer (AvgPooling) in parallel, a second multi-layer perceptron (Shared-MLP), an squeze operation, a concat operation (C), a 1D convolution layer (1DCNN), an unsqueze operation, three sigmoid layers (sigmoids), and a third extraction unit (including addition, dot multiplication, addition). The spatial feature map is extracted by using a 3D maximum pooling layer to obtain a first maximum pooling feature map, and the spatial feature map is extracted by using a 3D average pooling layer to obtain a first average pooling feature map. And respectively extracting the first maximum pooling feature map and the first average pooling feature map by using a second multilayer perceptron, and correspondingly obtaining the second maximum pooling feature map and the second average pooling feature map, wherein the second multilayer perceptron is similar to the first multilayer perceptron in structure, but the reduction coefficient and the amplification coefficient are both r, and r is 16. And the second maximum pooling feature map and the second average pooling feature map are sequentially subjected to squeeze operation, concat operation, 1D convolutional layer and unsqueeze operation to obtain a fused feature map, and specifically, the second maximum pooling feature map and the second average pooling feature map are respectively subjected to squeeze operation to obtain F'_maxAnd F'_avgAll dimensions are R^N×T×1. F 'connected by concat operation'_maxAnd F'_avgIs obtained from the second dimension of (1) to obtain F_ios，F_ios∈R^N×2T×1. Then F is mixed_iosThe relationship between the average and maximum features is further increased by a 1D convolutional layer with a convolutional kernel size of (3, 3). And finally, restoring the original dimension of the image by applying unsqueeze operation to obtain a fusion feature map. Extracting the second maximum pooling feature map, the second average pooling feature map and the fusion feature map by utilizing the one-to-one correspondence of the three sigmoid layers to correspondingly obtain a first pooling information map F_temp1∈R^N ^×T×1×1×1Second pooling information graph F_temp2∈R^{N×T×1×1×1}And a third pooled information graph F_temp3∈R^{N×T×1×1×1}. Adding the first, second and third pooling information maps to form a fourth pooling information map, multiplying the fourth pooling information map by the spatial feature map, and multiplying the fourth pooling information map by the spatial feature mapAnd adding and outputting a space-time characteristic matrix. The calculation formulas of the pooling information graphs and the space feature matrix are as follows:

F_temp1＝δ(SMLP(Max(X’)))

F_temp2＝δ(SMLP(Avg(X’)))

F_temp3＝δ(unsqueeze(Conv(squeeze([(Avg(X’)，Max(X’)])))))

Y’＝X’+X’·(F_temp1+F_temp2+F_temp3)

wherein Y 'is a space-time feature matrix, X' is a spatial feature map, delta is a sigmoid operation, SMLP is a second multi-tier perceptron operation, Max is a Max pooling operation, Avg is an average pooling operation, Conv is a convolution operation, squeeze is an squeeze operation, and unsqueeze is an unsqueeze operation.

The attention time sequence module can effectively aggregate the average pooling characteristics and the maximum pooling characteristics to extract sensitive information of the action in the time dimension. The problem that the network in the prior art is not sensitive to the important characteristics of specific action behaviors in time sequence modeling can be remarkably improved, for example, in a shooting video, more attention should be paid to the position of a ball and a hand in the video, which changes along with time, rather than the wrong focusing of the network in the prior art on a body part of a sportsman. It should be noted that, the attention timing modules perform similar operations, which are not described herein again.

And S4, aggregating the space-time feature matrixes output by the feature extraction modules and outputting feature vectors.

In one embodiment, in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, the weight ratio of each spatio-temporal feature matrix is 1: 1.

the dimensionality of the space-time feature matrix output by each feature extraction module is the same, and the weight ratio is preferably 1: 1, performing addition operation (namely aggregation Fusion) in alignment, and fusing a plurality of space-time feature matrixes can help to enhance understanding of a network to a motion video, so that robustness is remarkably improved, and complex scenes in real life can be met.

And S5, classifying and detecting the feature vectors by using the classifier, and taking the class with the highest probability as a detection result. The classifier adopts a linear connection layer of a neural network, outputs the probability that the video to be identified belongs to each category, and takes the category with the highest probability as the action video result.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express the more specific and detailed embodiments described in the present application, but not be construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A motion identification method based on a video multipath space-time characteristic network is characterized by comprising the following steps: the action identification method based on the video multipath space-time characteristic network comprises the following steps:

s3, establishing a spatio-temporal feature network model, wherein the spatio-temporal feature network model comprises a plurality of feature extraction modules, each image sequence is input to the feature extraction modules in a one-to-one correspondence manner, and the feature extraction modules execute the following operations:

s31, obtaining the middle characteristic X of the corresponding image sequence belongs to R^{N×T×C×H×W}Wherein N is the batch size, T is the total frame number of the video, C is the channel number of the image, H is the height of the image, and W is the width of the image;

s32, dividing the intermediate feature X equally into a first feature matrix X₀And a second feature matrix X₁And calculating a difference X₁-X₀As a difference feature, wherein X₀Is the first half of the middle feature X, X₁The latter half of the intermediate feature X, X₀、X₁∈R^{N×(T/2)×C×H×W}；

s39, adding the first pooling information map, the second pooling information map and the third pooling information map to form a fourth pooling information map, multiplying the fourth pooling information map by the spatial feature map points, adding the fourth pooling information map to the spatial feature map, and outputting a space-time feature matrix;

and S5, classifying and detecting the feature vectors by using a classifier, and taking the class with the highest probability as a detection result.

2. The method of claim 1, wherein the method comprises: in step S1, the pre-processing is to randomly crop the image to [256,320] pixels in width and height.

3. The method of claim 1, wherein the method comprises: in step S3, the spatio-temporal feature network model includes 2 feature extraction modules.

4. The method of claim 1, wherein the method comprises: in step S37, the second maximum pooling feature map and the second average pooling feature map are connected to a second dimension through a concat operation, and the fused feature map obtained by the convolutional layer further includes an squeeze operation and an unsqueeze operation, where the convolutional layer is a 1D convolutional layer, and the squeeze operation, the concat operation, the 1D convolutional layer, and the unsqueeze operation are performed in sequence.

5. The method of claim 1, wherein the method comprises: the reduction coefficient and the amplification coefficient of the first multilayer perceptron are both r and 2r, and r is 16.

6. The method of claim 1, wherein the method comprises: in step S4, when the spatio-temporal feature matrices output by the feature extraction modules are aggregated, the weight ratio of each spatio-temporal feature matrix is 1: 1.