CN113408343A

CN113408343A - Classroom action recognition method based on double-scale space-time block mutual attention

Info

Publication number: CN113408343A
Application number: CN202110518525.4A
Authority: CN
Inventors: 李平; 陈嘉; 曹佳晨; 徐向华
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-09-17
Anticipated expiration: 2041-05-12
Also published as: CN113408343B

Abstract

The invention discloses a classroom action identification method based on double-scale space-time blocking mutual attention. Firstly, preprocessing high-definition classroom student video data to obtain a student action video frame sequence; then constructing an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module, sequentially obtaining a double-scale space-time feature representation, a double-scale classification vector and an action classification probability vector, and performing iterative optimization on the action recognition model by using a random gradient descent algorithm; and inputting the preprocessed new classroom video into the model to obtain the classes of the student actions. The method not only uses space-time attention to model a plurality of groups of space-time blocks to capture multi-scale space-time information of off-line and on-line classroom student videos, but also can depict student picture information with different scales through a scale mutual attention mechanism, thereby improving the student action recognition accuracy of classroom videos.

Description

Classroom action recognition method based on double-scale space-time block mutual attention

Technical Field

The invention belongs to the technical field of video understanding and analysis, particularly relates to the technical field of motion recognition in video analysis, and relates to a classroom motion recognition method based on double-scale space-time block mutual attention.

Background

The traditional online class-leaving room is the main place for students to study and teachers to give lessons, and in recent years, online class-leaving especially in epidemic situations becomes a popular mode among teachers and students, and network live broadcast or advance recorded broadcast teaching is generally adopted. No matter the online class in the classroom or the online class using the network platform, the learning effect of the student is directly influenced by the quality of the teaching. The dilemma that is often encountered in practice is that teachers need to spend much energy on classroom discipline management in order to ensure the quality of classroom teaching, and cannot be put into teaching of teaching with full attention, which is particularly obvious in primary school classrooms. Therefore, a video action recognition technology is introduced to recognize actions of students in a classroom, the learning state of the students is sensed in real time, and an intelligent analysis report reflecting the classroom quality is provided. The classroom action recognition task takes the student action video frame sequence as input and outputs student action categories, and has wide application in scenes such as classroom teaching, self-service management, unmanned invigilation and the like. For example, in an unmanned invigilation environment, the classroom action recognition method can recognize the action of an examinee in real time, and the examinee can be investigated if a suspected cheating action occurs, so that the examination discipline is ensured. The main challenges are: it is difficult to unify the offline and online classroom motion recognition methods, there are students in different distances in the same video picture, and a large amount of calculation overhead is required for performing motion recognition on a plurality of students.

Currently, practical applications for classroom scene action recognition are few, and the existing method is mainly based on wearable equipment and skeleton information. However, the wearable device may cause discomfort to the student, which in turn affects the learning efficiency of the student; the method based on the skeleton information can identify fewer motion types, and the identification performance is very easily influenced by the shielding of objects such as tables, chairs and books. In addition, the traditional motion recognition method needs to encode the video frame into manual features (such as features of HOG3D, 3Dsurf and the like), but the manual features have great limitations and the extraction speed is low, so that the real-time requirement cannot be met. In recent years, an action recognition method with a Convolutional Neural Network (CNN) as a core can learn feature representation reflecting video latent semantic information end to end, and the accuracy of action recognition is greatly improved. In order to extract more effective visual features, a residual error network (ResNet) uses residual error connection to connect different layers of the network, so that the problems of overfitting, gradient disappearance or gradient explosion and the like generated during the training of a deeper neural network model are solved; a Non-Local Network (Non-Local Network) captures long-distance dependency relationship by using a Non-Local operation, establishes connection among pixel blocks at different distances of a video frame image through an attention mechanism, and mines semantic information among the pixel blocks. In addition, Transformer (Transformer) models, derived from the natural language processing domain, have recently been favored in the computer vision domain, where a lot of attention is paid to extracting the critical timing information of diversity in a sequence of video frames, so that the models can learn more discriminative feature representations.

The existing classroom action recognition technology still has many defects: firstly, designing a model separately for an offline classroom or an online classroom, and lacking a unified interface for fusing two types of classroom action identification methods; secondly, calculating space-time attention on all video frames in blocks when the features are extracted, neglecting the local characteristics of the space-time features to reduce the recognition rate, and calculating cost is overlarge when the video resolution is large; in addition, many methods only extract the space-time characteristics of single-scale blocks, and are difficult to adapt to the situation that the picture scales of individual students are different. In order to solve the problems of lack of a local space-time characteristic information exchange mechanism, adaptation to individual student pictures with different scales and the like, an efficient classroom action identification method which unifies an offline classroom and an online classroom and can improve student action identification accuracy is urgently needed.

Disclosure of Invention

The invention aims to provide a classroom action recognition method based on double-scale space-time block mutual attention aiming at the defects of the prior art, wherein a plurality of groups of space-time blocks are modeled by space-time attention so as to capture multi-scale space-time information of videos of students in offline and online classrooms, and the scale mutual attention is utilized to depict the picture information of the students in different scales so as to improve the recognition rate of classroom actions.

The method firstly acquires high-definition classroom student video data, and then sequentially performs the following operations:

preprocessing high-definition classroom student video data to obtain a student action video frame sequence;

constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation;

constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector;

step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector;

step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged;

and (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of the action of the student.

Further, the step (1) is specifically:

(1-1) processing each high-definition classroom student video into a corresponding video frame sequence at a sampling rate of k frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 60k frames to obtain a high-definition classroom student image data set, wherein k is 15-30;

(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence

In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, f_iThe frame sequence is represented by an RGB three-channel image with ith height H and width W, and T is the total frame number, namely T is 60 k.

Still further, the step (2) is specifically:

(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;

(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristics

Wherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;

(2-3) features of pooled spatio-temporal

The height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t moment

And small scale block feature vectors

D represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;

respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrix

And small scale spatio-temporal feature matrix

[·,…,·]Representing a splicing operation; wherein the total number of large-scale space feature blocks

Total number of small-scale spatial feature blocks

Output dual scale spatiotemporal feature representation { X^l，X^s}。

Further, the step (3) is specifically:

(3-1) the space-time block mutual attention encoder is formed by connecting R space-time block mutual attention modules in series, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule; input is dual-scale space-time feature representation { X^l，X^s}；

(3-2) the r-th space-time block mutual attention module

Input dual scale spatiotemporal feature tensor

Wherein the input large-scale space-time feature matrix

Input small scale space-time feature matrix

And

classifying the large-scale classification vector and the small-scale classification vector;

the r space-time block mutual attention module

Output dual-scale mutual attention feature tensor

Wherein, the output large-scale mutual attention feature matrix

Output small-scale mutual attention feature matrix

And

for the output large scale classification vector and the small scale classification vector,

and

the large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;

when r is 1, the input large-scale space-time characteristic matrix

Input small scale space-time feature matrix

Large scale classification vector

And small scale classification vectors

Obtained by random initialization;

when R is more than or equal to R and more than 1, the input is in double scaleEmpty feature tensor

For last space-time block mutual attention module

Output dual-scale mutual attention feature tensor

Namely, it is

The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention module

Double-scale classification vector of (1)

And

(3-3) the r-th double-scale space-time partitioning mutual attention module

The space-time block generation submodule of (a) will input

Z in (1)^r,lAnd Z^r,sLarge-scale feature mapping by respective regrouping to uniform size

And small scale feature mapping

Wherein the height dimension

Width dimension

According to the height dimension h^rWidth dimension w^rTime dimension t^rWill be provided with

Performing space-time blocking to obtain the r group of large-scale space-time block feature tensors

Where j is the index subscript, Q, of the large-scale space-time block^rThe total number of the r-th group of large-scale space-time blocks meets the condition that:

and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namely

r≥2：

Then will be

Dimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time block

Wherein the total number n of the spatial feature blocks of the large-scale space-time block^l＝h^rw^r；

Will be provided with

And

splicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group

Obtaining the updated small-scale space-time block feature tensor element by the same operation

Wherein the total number n of the space characteristic blocks of the small-scale space-time block^s＝h^rw^rγ²；

Obtaining the r group dual-scale space-time block feature tensor

And

(3-4) the r-th double-scale space-time partitioning mutual attention module

The input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submodule

And

the jth large-scale space-time block feature tensor elements of the r group

Linear mapping is carried out to obtain the query matrix of the target object at each attention head

Key matrix

Sum matrix

Wherein, the attention head number a is 1, …, a is attentionTotal number of heads, dimension of each vector in the mapping matrix

Calculating the corresponding multi-head space-time self-attention weight characteristics

Wherein Softmax (-) is a normalized exponential function;

use of

Learnable parameter

And calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure

Will be provided with

Decomposing to obtain updated large-scale space-time block classification vector

And large scale space-time block space-time feature matrix

MLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;

same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix

Thereby obtaining the r-th group of space-time block space-time attention characteristic tensors

And

(3-5) the r-th double-scale space-time partitioning mutual attention module

The input of the scale mutual attention submodule is the output of the space-time attention submodule

And

wherein the jth group of jth dual-scale space-time block classification vectors are

And

the space-time feature matrix of the dual-scale space-time block is

And

classifying large scale space-time blocks into vectors

Linear mapping is carried out to obtain the query vector

Classifying large scale space-time blocks into vectors

With small scale space-time block space-time characteristic matrix

Linear mapping is carried out to obtain the key matrix

Sum matrix

Computing multi-headed spatiotemporal self-attention weight features

Use of

Learnable parameter

Calculating a sum residual structure to obtain an updated large-scale space-time block classification vector

Thereby obtaining the r-th group of all large-scale space-time block classification vectors

Linear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector

Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrix

Splicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix

By the same operation, get smallScale classification vector

And small scale mutual attention feature matrix

The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor

Still further, the step (4) is specifically:

(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoder

And

respectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptron

And a small scale score vector

(4-2) outputting the action class probability vector

Still further, the step (5) is specifically:

(5-1) the motion recognition model is composed of the dual-scale feature embedding module in the step (2), the dual-scale space-time block mutual attention encoder in the step (3) and the motion classification module in the step (4)

(5-2) motion recognition model

The input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding module^lAnd X^sInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectors

And

inputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;

(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy loss

Optimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y is_bIs the probability that the student action belongs to action class b,

is a real mark, if the action category of the classroom student video belongs to b,

otherwise

Still further, the step (6) is specifically:

(6-1) inputting the high-definition classroom student image data set marked with the student position bounding box into a target detection model YOLOv5 pre-trained on a COCO2017 data set, and iteratively training the model until the model converges to obtain a targetDetection model

(6-2) for a new classroom student video, the video frame sequence is obtained by the aid of the (1-1), and the first frame image is input into the target detection model

Obtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2)

Wherein phi is the serial number of the student, chi is the total number of the student,

representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;

(6-3) moving video frame sequence for each student

Inputting the motion recognition model obtained by training in the step (5)

In the method, motion class probability vectors of phi-th students are obtained

And the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)^φ) Where argmax (·) is the index of the largest element in the vector.

The method of the invention utilizes a double-scale space-time block mutual attention encoder to identify the student action in the student video, and has the following characteristics: 1) different from the existing method which only designs for offline or online classes, the method of the invention firstly utilizes the target detection model to obtain the action frame sequence of students, and then further identifies the action category of each student, and can be generally used in the application scenes of the offline classes and the online classes; 2) different from the existing method for calculating the space-time attention of all video frame blocks during each step of feature extraction, the method of the invention uses a space-time block generation submodule and a space-time attention submodule to extract the space-time features in a plurality of groups of space-time blocks so as to realize local space-time feature information exchange and greatly reduce the calculation overhead; 3) the method of the invention uses two different sizes to block the video frame, and combines the scale mutual attention sub-module, so as to better extract the action information of the individual student pictures with different scales in the video.

The method is suitable for action recognition under the complex classroom scene with participation of a plurality of students and different picture scales of individual students, and has the advantages that: 1) the method unifies the action recognition methods of the offline classroom and the online classroom, and reduces the technical cost of applying the action recognition method to the two classes; 2) extracting features of a plurality of different space-time regions through a space-time block generation submodule and a space-time attention submodule, and fully considering the local characteristics of space-time features to obtain more accurate identification categories and improve the calculation efficiency; 3) the scale mutual attention submodule is used for learning the individual student pictures with different scales, and the space-time characteristics under the two scale blocks are fully fused to obtain better identification performance. The invention has the capability of local space-time characteristic learning and the capability of capturing individual student picture space characteristics with different scales, and can improve the student action recognition rate in practical application scenes such as classroom teaching supervision, self-service class management, unmanned invigilation and the like.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

A classroom action recognition method based on double-scale space-time block mutual attention is characterized by firstly sampling classroom student videos to obtain video frame sequences of the classroom students, obtaining a boundary frame of each student position by using a target detection model, further intercepting frame images in the boundary frame to obtain the student action video frame sequences, then constructing an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module, and finally judging classes of student actions by using the action recognition model. The method utilizes a target detection model to obtain a student action frame sequence to further identify action which can be commonly used in an off-line class and an on-line class, utilizes a space-time block generation submodule and a space-time attention submodule to extract space-time characteristics of a plurality of groups of space-time blocks so as to realize local space-time characteristic information exchange, and utilizes two block scales and scale mutual attention submodules to capture action information of different scales so as to adapt to the condition that the picture scales of students are different. The classroom action recognition system constructed in the mode can be uniformly deployed and applied to two classes, and meanwhile, the spatiotemporal information of student action video frames can be effectively extracted and student action categories can be efficiently recognized.

As shown in fig. 1, the method first obtains high definition classroom student video data, and then performs the following operations:

preprocessing high-definition classroom student video data to obtain a student action video frame sequence; the method comprises the following steps:

(1-1) processing each online or offline high-definition classroom student video into a corresponding video frame sequence at a sampling rate of 25 frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 1500 frames per minute to obtain a high-definition classroom student image data set;

In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, f_iRGB three-channel image with ith height H and width W in frame sequenceAnd T is the total frame number, namely T is 1500.

Constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation; the method comprises the following steps:

(2-3) features of pooled spatio-temporal

And small scale block feature vectors

And small scale spatio-temporal feature matrix

[·,…,·]Representing a splicing operation; wherein, the large-scale space feature block assemblyNumber of

Total number of small-scale spatial feature blocks

Output dual scale spatiotemporal feature representation { X^l，X^s}。

Constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector; the method comprises the following steps:

(3-2) the r-th space-time block mutual attention module

Input dual scale spatiotemporal feature tensor

Wherein the input large-scale space-time feature matrix

Input small scale space-time feature matrix

And

the r space-time block mutual attention module

Output dual-scale mutual attention feature tensor

Wherein, the output large-scale mutual attention feature matrix

Output small-scale mutual attention feature matrix

And

and

when r is 1, the input large-scale space-time characteristic matrix

Input small scale space-time feature matrix

Large scale classification vector

And small scale classification vectors

Obtained by random initialization;

when R is not less thanWhen r is more than 1, the input double-scale space-time feature tensor

For last space-time block mutual attention module

Output dual-scale mutual attention feature tensor

Namely, it is

Double-scale classification vector of (1)

And

(3-3) the r-th double-scale space-time partitioning mutual attention module

The space-time block generation submodule of (a) will input

And small scale feature mapping

Wherein the height dimension

Width dimension

r≥2：

Then will be

Will be provided with

And

Obtaining the r group dual-scale space-time block feature tensor

And

(3-4) the r-th double-scale space-time partitioning mutual attention module

And

the jth large-scale space-time block feature tensor elements of the r group

Key matrix

Sum matrix

Wherein, attention headThe index a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix is

Wherein Softmax (-) is a normalized exponential function;

use of

Learnable parameter

Will be provided with

And large scale space-time block space-time feature matrix

And

(3-5) the r-th double-scale space-time partitioning mutual attention module

And

And

the space-time feature matrix of the dual-scale space-time block is

And

classifying large scale space-time blocks into vectors

Linear mapping is carried out to obtain the query vector

Classifying large scale space-time blocks into vectors

With small scale space-time block space-time characteristic matrix

Linear mapping is carried out to obtain the key matrix

Sum matrix

Computing multi-headed spatiotemporal self-attention weight features

Use of

Learnable parameter

Same operation is carried out to obtain small-scale classification vectors

And small scale mutual attention feature matrix

Step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector; the method comprises the following steps:

And

And a small scale score vector

(4-2) outputting the action class probability vector

Step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged; the method comprises the following steps:

(5-1) embedding the dual-scale features of the step (2)Forming an action recognition model by the input module, the double-scale space-time blocked mutual attention encoder in the step (3) and the action classification module in the step (4)

(5-2) motion recognition model

And

otherwise

Step (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of student action; the method comprises the following steps:

(6-1) inputting the high-definition classroom student image dataset marked with the student position bounding box into an open source target detection model YOLOv5 pre-trained on the existing COCO2017 dataset, and iteratively training the model until the model converges to obtain a target detection model

(6-3) moving video frame sequence for each student

Inputting the motion recognition model obtained by training in the step (5)

In the method, motion class probability vectors of phi-th students are obtained

The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims

1. The classroom action identification method based on double-scale space-time block mutual attention is characterized in that the method firstly obtains high-definition classroom student video data and then carries out the following operations:

2. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 1, wherein the step (1) is specifically as follows:

3. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 2, wherein the step (2) is specifically as follows:

(2-3) features of pooled spatio-temporal

And small scale block feature vectors

And small scale spatio-temporal feature matrix

Total number of small-scale spatial feature blocks

Output dual scale spatiotemporal feature representation { X^l，X^s}。

4. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 3, wherein the step (3) is specifically as follows:

(3-1) the space-time block mutual attention encoder is composed of R space-time block mutual attention modules in series connection, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule(ii) a Input is dual-scale space-time feature representation { X^l，X^s}；

(3-2) the r-th space-time block mutual attention module

Input dual scale spatiotemporal feature tensor

Wherein the input large-scale space-time feature matrix

Input small scale space-time feature matrix

And

the r space-time block mutual attention module

Output dual-scale mutual attention feature tensor

Wherein, the output large-scale mutual attention feature matrix

Output small-scale mutual attention feature matrix

And

and

when r is 1, the input large-scale space-time characteristic matrix

Input small scale space-time feature matrix

Large scale classification vector

And small scale classification vectors

Obtained by random initialization;

when R is more than or equal to R and more than 1, the input double-scale space-time feature tensor

For last space-time block mutual attention module

Output dual-scale mutual attention feature tensor

Namely, it is

Double-scale classification vector of (1)

And

(3-3) the r-th double-scale space-time partitioning mutual attention module

The space-time block generation submodule of (a) will input

Z in (1)^r ^,lAnd Z^r,sLarge-scale feature mapping by respective regrouping to uniform size

And small scale feature mapping

Wherein the height dimension

Width dimension

Then will be

Will be provided with

And

Obtaining the r group dual-scale space-time block feature tensor

And

(3-4) the r-th double-scale space-time partitioning mutual attention module

And

the jth large-scale space-time block feature tensor elements of the r group

Key matrix

Sum matrix

Wherein, the attention head sequence number a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix is

Wherein Softmax (-) is a normalized exponential function;

use of

Learnable parameter

Will be provided with

And large scale space-time block space-time feature matrix

And

(3-5) the r-th double-scale space-time partitioning mutual attention module

And

And

the space-time feature matrix of the dual-scale space-time block is

And

classifying large scale space-time blocks into vectors

Linear mapping is carried out to obtain the query vector

Classifying large scale space-time blocks into vectors

With small scale space-time block space-time characteristic matrix

Linear mapping is carried out to obtain the key matrix

Sum matrix

Computing multi-headed spatiotemporal self-attention weight features

Use of

Learnable parameter

Same operation is carried out to obtain small-scale classification vectors

And small scale mutual attention feature matrix

5. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 4, wherein the step (4) is specifically as follows:

And

And a small scale score vector

(4-2) outputting the action class probability vector

6. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 5, wherein the step (5) is specifically as follows:

(5-2) motion recognition model

And

otherwise

7. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 6, wherein step (6) is specifically:

(6-1) inputting the high-definition classroom student image data set marked with the student position bounding box into a target detection model YOLOv5 pre-trained on a COCO2017 data set, and iteratively training the model until the model converges to obtain the target detection model

(6-3) moving video frame sequence for each student

Inputting the motion recognition model obtained by training in the step (5)

In the method, motion class probability vectors of phi-th students are obtained