CN113408343A - Classroom action recognition method based on double-scale space-time block mutual attention - Google Patents

Classroom action recognition method based on double-scale space-time block mutual attention Download PDF

Info

Publication number
CN113408343A
CN113408343A CN202110518525.4A CN202110518525A CN113408343A CN 113408343 A CN113408343 A CN 113408343A CN 202110518525 A CN202110518525 A CN 202110518525A CN 113408343 A CN113408343 A CN 113408343A
Authority
CN
China
Prior art keywords
scale
space
time
student
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110518525.4A
Other languages
Chinese (zh)
Other versions
CN113408343B (en
Inventor
李平
陈嘉
曹佳晨
徐向华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110518525.4A priority Critical patent/CN113408343B/en
Publication of CN113408343A publication Critical patent/CN113408343A/en
Application granted granted Critical
Publication of CN113408343B publication Critical patent/CN113408343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Strategic Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a classroom action identification method based on double-scale space-time blocking mutual attention. Firstly, preprocessing high-definition classroom student video data to obtain a student action video frame sequence; then constructing an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module, sequentially obtaining a double-scale space-time feature representation, a double-scale classification vector and an action classification probability vector, and performing iterative optimization on the action recognition model by using a random gradient descent algorithm; and inputting the preprocessed new classroom video into the model to obtain the classes of the student actions. The method not only uses space-time attention to model a plurality of groups of space-time blocks to capture multi-scale space-time information of off-line and on-line classroom student videos, but also can depict student picture information with different scales through a scale mutual attention mechanism, thereby improving the student action recognition accuracy of classroom videos.

Description

Classroom action recognition method based on double-scale space-time block mutual attention
Technical Field
The invention belongs to the technical field of video understanding and analysis, particularly relates to the technical field of motion recognition in video analysis, and relates to a classroom motion recognition method based on double-scale space-time block mutual attention.
Background
The traditional online class-leaving room is the main place for students to study and teachers to give lessons, and in recent years, online class-leaving especially in epidemic situations becomes a popular mode among teachers and students, and network live broadcast or advance recorded broadcast teaching is generally adopted. No matter the online class in the classroom or the online class using the network platform, the learning effect of the student is directly influenced by the quality of the teaching. The dilemma that is often encountered in practice is that teachers need to spend much energy on classroom discipline management in order to ensure the quality of classroom teaching, and cannot be put into teaching of teaching with full attention, which is particularly obvious in primary school classrooms. Therefore, a video action recognition technology is introduced to recognize actions of students in a classroom, the learning state of the students is sensed in real time, and an intelligent analysis report reflecting the classroom quality is provided. The classroom action recognition task takes the student action video frame sequence as input and outputs student action categories, and has wide application in scenes such as classroom teaching, self-service management, unmanned invigilation and the like. For example, in an unmanned invigilation environment, the classroom action recognition method can recognize the action of an examinee in real time, and the examinee can be investigated if a suspected cheating action occurs, so that the examination discipline is ensured. The main challenges are: it is difficult to unify the offline and online classroom motion recognition methods, there are students in different distances in the same video picture, and a large amount of calculation overhead is required for performing motion recognition on a plurality of students.
Currently, practical applications for classroom scene action recognition are few, and the existing method is mainly based on wearable equipment and skeleton information. However, the wearable device may cause discomfort to the student, which in turn affects the learning efficiency of the student; the method based on the skeleton information can identify fewer motion types, and the identification performance is very easily influenced by the shielding of objects such as tables, chairs and books. In addition, the traditional motion recognition method needs to encode the video frame into manual features (such as features of HOG3D, 3Dsurf and the like), but the manual features have great limitations and the extraction speed is low, so that the real-time requirement cannot be met. In recent years, an action recognition method with a Convolutional Neural Network (CNN) as a core can learn feature representation reflecting video latent semantic information end to end, and the accuracy of action recognition is greatly improved. In order to extract more effective visual features, a residual error network (ResNet) uses residual error connection to connect different layers of the network, so that the problems of overfitting, gradient disappearance or gradient explosion and the like generated during the training of a deeper neural network model are solved; a Non-Local Network (Non-Local Network) captures long-distance dependency relationship by using a Non-Local operation, establishes connection among pixel blocks at different distances of a video frame image through an attention mechanism, and mines semantic information among the pixel blocks. In addition, Transformer (Transformer) models, derived from the natural language processing domain, have recently been favored in the computer vision domain, where a lot of attention is paid to extracting the critical timing information of diversity in a sequence of video frames, so that the models can learn more discriminative feature representations.
The existing classroom action recognition technology still has many defects: firstly, designing a model separately for an offline classroom or an online classroom, and lacking a unified interface for fusing two types of classroom action identification methods; secondly, calculating space-time attention on all video frames in blocks when the features are extracted, neglecting the local characteristics of the space-time features to reduce the recognition rate, and calculating cost is overlarge when the video resolution is large; in addition, many methods only extract the space-time characteristics of single-scale blocks, and are difficult to adapt to the situation that the picture scales of individual students are different. In order to solve the problems of lack of a local space-time characteristic information exchange mechanism, adaptation to individual student pictures with different scales and the like, an efficient classroom action identification method which unifies an offline classroom and an online classroom and can improve student action identification accuracy is urgently needed.
Disclosure of Invention
The invention aims to provide a classroom action recognition method based on double-scale space-time block mutual attention aiming at the defects of the prior art, wherein a plurality of groups of space-time blocks are modeled by space-time attention so as to capture multi-scale space-time information of videos of students in offline and online classrooms, and the scale mutual attention is utilized to depict the picture information of the students in different scales so as to improve the recognition rate of classroom actions.
The method firstly acquires high-definition classroom student video data, and then sequentially performs the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence;
constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation;
constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector;
step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector;
step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged;
and (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of the action of the student.
Further, the step (1) is specifically:
(1-1) processing each high-definition classroom student video into a corresponding video frame sequence at a sampling rate of k frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 60k frames to obtain a high-definition classroom student image data set, wherein k is 15-30;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence
Figure BDA0003062928370000031
Figure BDA0003062928370000032
In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiThe frame sequence is represented by an RGB three-channel image with ith height H and width W, and T is the total frame number, namely T is 60 k.
Still further, the step (2) is specifically:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristics
Figure BDA0003062928370000033
Wherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporal
Figure BDA0003062928370000034
The height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t moment
Figure BDA0003062928370000035
And small scale block feature vectors
Figure BDA0003062928370000036
D represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrix
Figure BDA0003062928370000037
And small scale spatio-temporal feature matrix
Figure BDA0003062928370000038
[·,…,·]Representing a splicing operation; wherein the total number of large-scale space feature blocks
Figure BDA0003062928370000039
Total number of small-scale spatial feature blocks
Figure BDA00030629283700000310
Output dual scale spatiotemporal feature representation { Xl,Xs}。
Further, the step (3) is specifically:
(3-1) the space-time block mutual attention encoder is formed by connecting R space-time block mutual attention modules in series, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule; input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention module
Figure BDA00030629283700000318
Input dual scale spatiotemporal feature tensor
Figure BDA00030629283700000311
Wherein the input large-scale space-time feature matrix
Figure BDA00030629283700000312
Input small scale space-time feature matrix
Figure BDA00030629283700000313
Figure BDA00030629283700000314
And
Figure BDA00030629283700000315
classifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention module
Figure BDA00030629283700000316
Output dual-scale mutual attention feature tensor
Figure BDA00030629283700000317
Wherein, the output large-scale mutual attention feature matrix
Figure BDA0003062928370000041
Output small-scale mutual attention feature matrix
Figure BDA0003062928370000042
Figure BDA0003062928370000043
And
Figure BDA0003062928370000044
for the output large scale classification vector and the small scale classification vector,
Figure BDA0003062928370000045
and
Figure BDA0003062928370000046
the large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrix
Figure BDA0003062928370000047
Input small scale space-time feature matrix
Figure BDA0003062928370000048
Large scale classification vector
Figure BDA0003062928370000049
And small scale classification vectors
Figure BDA00030629283700000410
Obtained by random initialization;
when R is more than or equal to R and more than 1, the input is in double scaleEmpty feature tensor
Figure BDA00030629283700000411
For last space-time block mutual attention module
Figure BDA00030629283700000412
Output dual-scale mutual attention feature tensor
Figure BDA00030629283700000413
Namely, it is
Figure BDA00030629283700000414
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention module
Figure BDA00030629283700000415
Double-scale classification vector of (1)
Figure BDA00030629283700000416
And
Figure BDA00030629283700000417
(3-3) the r-th double-scale space-time partitioning mutual attention module
Figure BDA00030629283700000418
The space-time block generation submodule of (a) will input
Figure BDA00030629283700000419
Z in (1)r,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform size
Figure BDA00030629283700000420
And small scale feature mapping
Figure BDA00030629283700000421
Wherein the height dimension
Figure BDA00030629283700000422
Width dimension
Figure BDA00030629283700000423
According to the height dimension hrWidth dimension wrTime dimension trWill be provided with
Figure BDA00030629283700000424
Performing space-time blocking to obtain the r group of large-scale space-time block feature tensors
Figure BDA00030629283700000425
Where j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:
Figure BDA00030629283700000426
and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namely
Figure BDA00030629283700000427
r≥2:
Then will be
Figure BDA00030629283700000428
Dimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time block
Figure BDA00030629283700000429
Wherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr
Will be provided with
Figure BDA00030629283700000430
And
Figure BDA00030629283700000431
splicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Figure BDA00030629283700000432
Obtaining the updated small-scale space-time block feature tensor element by the same operation
Figure BDA00030629283700000433
Wherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2
Obtaining the r group dual-scale space-time block feature tensor
Figure BDA0003062928370000051
And
Figure BDA0003062928370000052
(3-4) the r-th double-scale space-time partitioning mutual attention module
Figure BDA0003062928370000053
The input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submodule
Figure BDA0003062928370000054
And
Figure BDA0003062928370000055
the jth large-scale space-time block feature tensor elements of the r group
Figure BDA0003062928370000056
Linear mapping is carried out to obtain the query matrix of the target object at each attention head
Figure BDA0003062928370000057
Key matrix
Figure BDA0003062928370000058
Sum matrix
Figure BDA0003062928370000059
Wherein, the attention head number a is 1, …, a is attentionTotal number of heads, dimension of each vector in the mapping matrix
Figure BDA00030629283700000510
Calculating the corresponding multi-head space-time self-attention weight characteristics
Figure BDA00030629283700000511
Figure BDA00030629283700000512
Wherein Softmax (-) is a normalized exponential function;
use of
Figure BDA00030629283700000513
Learnable parameter
Figure BDA00030629283700000514
And calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Figure BDA00030629283700000515
Will be provided with
Figure BDA00030629283700000516
Decomposing to obtain updated large-scale space-time block classification vector
Figure BDA00030629283700000517
And large scale space-time block space-time feature matrix
Figure BDA00030629283700000518
MLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
Figure BDA00030629283700000519
Thereby obtaining the r-th group of space-time block space-time attention characteristic tensors
Figure BDA00030629283700000520
And
Figure BDA00030629283700000521
(3-5) the r-th double-scale space-time partitioning mutual attention module
Figure BDA00030629283700000522
The input of the scale mutual attention submodule is the output of the space-time attention submodule
Figure BDA00030629283700000523
And
Figure BDA00030629283700000524
wherein the jth group of jth dual-scale space-time block classification vectors are
Figure BDA00030629283700000525
And
Figure BDA00030629283700000526
the space-time feature matrix of the dual-scale space-time block is
Figure BDA00030629283700000527
And
Figure BDA00030629283700000528
classifying large scale space-time blocks into vectors
Figure BDA00030629283700000529
Linear mapping is carried out to obtain the query vector
Figure BDA00030629283700000530
Classifying large scale space-time blocks into vectors
Figure BDA00030629283700000531
With small scale space-time block space-time characteristic matrix
Figure BDA00030629283700000532
Linear mapping is carried out to obtain the key matrix
Figure BDA00030629283700000533
Sum matrix
Figure BDA00030629283700000534
Computing multi-headed spatiotemporal self-attention weight features
Figure BDA00030629283700000535
Use of
Figure BDA00030629283700000536
Learnable parameter
Figure BDA00030629283700000537
Calculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Figure BDA0003062928370000061
Thereby obtaining the r-th group of all large-scale space-time block classification vectors
Figure BDA0003062928370000062
Linear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Figure BDA0003062928370000063
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrix
Figure BDA0003062928370000064
Splicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
Figure BDA0003062928370000065
By the same operation, get smallScale classification vector
Figure BDA0003062928370000066
And small scale mutual attention feature matrix
Figure BDA0003062928370000067
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
Figure BDA0003062928370000068
Still further, the step (4) is specifically:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoder
Figure BDA0003062928370000069
And
Figure BDA00030629283700000610
respectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptron
Figure BDA00030629283700000611
And a small scale score vector
Figure BDA00030629283700000612
(4-2) outputting the action class probability vector
Figure BDA00030629283700000613
Still further, the step (5) is specifically:
(5-1) the motion recognition model is composed of the dual-scale feature embedding module in the step (2), the dual-scale space-time block mutual attention encoder in the step (3) and the motion classification module in the step (4)
Figure BDA00030629283700000614
(5-2) motion recognition model
Figure BDA00030629283700000615
The input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectors
Figure BDA00030629283700000616
And
Figure BDA00030629283700000617
inputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy loss
Figure BDA00030629283700000618
Optimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,
Figure BDA00030629283700000619
is a real mark, if the action category of the classroom student video belongs to b,
Figure BDA0003062928370000071
otherwise
Figure BDA0003062928370000072
Still further, the step (6) is specifically:
(6-1) inputting the high-definition classroom student image data set marked with the student position bounding box into a target detection model YOLOv5 pre-trained on a COCO2017 data set, and iteratively training the model until the model converges to obtain a targetDetection model
Figure BDA0003062928370000073
(6-2) for a new classroom student video, the video frame sequence is obtained by the aid of the (1-1), and the first frame image is input into the target detection model
Figure BDA0003062928370000074
Obtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2)
Figure BDA0003062928370000075
Figure BDA0003062928370000076
Wherein phi is the serial number of the student, chi is the total number of the student,
Figure BDA0003062928370000077
representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each student
Figure BDA0003062928370000078
Inputting the motion recognition model obtained by training in the step (5)
Figure BDA0003062928370000079
In the method, motion class probability vectors of phi-th students are obtained
Figure BDA00030629283700000710
And the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
The method of the invention utilizes a double-scale space-time block mutual attention encoder to identify the student action in the student video, and has the following characteristics: 1) different from the existing method which only designs for offline or online classes, the method of the invention firstly utilizes the target detection model to obtain the action frame sequence of students, and then further identifies the action category of each student, and can be generally used in the application scenes of the offline classes and the online classes; 2) different from the existing method for calculating the space-time attention of all video frame blocks during each step of feature extraction, the method of the invention uses a space-time block generation submodule and a space-time attention submodule to extract the space-time features in a plurality of groups of space-time blocks so as to realize local space-time feature information exchange and greatly reduce the calculation overhead; 3) the method of the invention uses two different sizes to block the video frame, and combines the scale mutual attention sub-module, so as to better extract the action information of the individual student pictures with different scales in the video.
The method is suitable for action recognition under the complex classroom scene with participation of a plurality of students and different picture scales of individual students, and has the advantages that: 1) the method unifies the action recognition methods of the offline classroom and the online classroom, and reduces the technical cost of applying the action recognition method to the two classes; 2) extracting features of a plurality of different space-time regions through a space-time block generation submodule and a space-time attention submodule, and fully considering the local characteristics of space-time features to obtain more accurate identification categories and improve the calculation efficiency; 3) the scale mutual attention submodule is used for learning the individual student pictures with different scales, and the space-time characteristics under the two scale blocks are fully fused to obtain better identification performance. The invention has the capability of local space-time characteristic learning and the capability of capturing individual student picture space characteristics with different scales, and can improve the student action recognition rate in practical application scenes such as classroom teaching supervision, self-service class management, unmanned invigilation and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
A classroom action recognition method based on double-scale space-time block mutual attention is characterized by firstly sampling classroom student videos to obtain video frame sequences of the classroom students, obtaining a boundary frame of each student position by using a target detection model, further intercepting frame images in the boundary frame to obtain the student action video frame sequences, then constructing an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module, and finally judging classes of student actions by using the action recognition model. The method utilizes a target detection model to obtain a student action frame sequence to further identify action which can be commonly used in an off-line class and an on-line class, utilizes a space-time block generation submodule and a space-time attention submodule to extract space-time characteristics of a plurality of groups of space-time blocks so as to realize local space-time characteristic information exchange, and utilizes two block scales and scale mutual attention submodules to capture action information of different scales so as to adapt to the condition that the picture scales of students are different. The classroom action recognition system constructed in the mode can be uniformly deployed and applied to two classes, and meanwhile, the spatiotemporal information of student action video frames can be effectively extracted and student action categories can be efficiently recognized.
As shown in fig. 1, the method first obtains high definition classroom student video data, and then performs the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence; the method comprises the following steps:
(1-1) processing each online or offline high-definition classroom student video into a corresponding video frame sequence at a sampling rate of 25 frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 1500 frames per minute to obtain a high-definition classroom student image data set;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence
Figure BDA0003062928370000081
Figure BDA0003062928370000082
In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiRGB three-channel image with ith height H and width W in frame sequenceAnd T is the total frame number, namely T is 1500.
Constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation; the method comprises the following steps:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristics
Figure BDA0003062928370000083
Wherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporal
Figure BDA0003062928370000091
The height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t moment
Figure BDA0003062928370000092
And small scale block feature vectors
Figure BDA0003062928370000093
D represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrix
Figure BDA0003062928370000094
And small scale spatio-temporal feature matrix
Figure BDA0003062928370000095
[·,…,·]Representing a splicing operation; wherein, the large-scale space feature block assemblyNumber of
Figure BDA0003062928370000096
Total number of small-scale spatial feature blocks
Figure BDA0003062928370000097
Output dual scale spatiotemporal feature representation { Xl,Xs}。
Constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector; the method comprises the following steps:
(3-1) the space-time block mutual attention encoder is formed by connecting R space-time block mutual attention modules in series, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule; input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention module
Figure BDA0003062928370000098
Input dual scale spatiotemporal feature tensor
Figure BDA0003062928370000099
Wherein the input large-scale space-time feature matrix
Figure BDA00030629283700000910
Input small scale space-time feature matrix
Figure BDA00030629283700000911
Figure BDA00030629283700000912
And
Figure BDA00030629283700000913
classifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention module
Figure BDA00030629283700000914
Output dual-scale mutual attention feature tensor
Figure BDA00030629283700000915
Wherein, the output large-scale mutual attention feature matrix
Figure BDA00030629283700000916
Output small-scale mutual attention feature matrix
Figure BDA00030629283700000917
Figure BDA00030629283700000918
And
Figure BDA00030629283700000919
for the output large scale classification vector and the small scale classification vector,
Figure BDA00030629283700000920
and
Figure BDA00030629283700000921
the large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrix
Figure BDA00030629283700000922
Input small scale space-time feature matrix
Figure BDA00030629283700000923
Large scale classification vector
Figure BDA00030629283700000924
And small scale classification vectors
Figure BDA00030629283700000925
Obtained by random initialization;
when R is not less thanWhen r is more than 1, the input double-scale space-time feature tensor
Figure BDA00030629283700000926
For last space-time block mutual attention module
Figure BDA00030629283700000927
Output dual-scale mutual attention feature tensor
Figure BDA0003062928370000101
Namely, it is
Figure BDA0003062928370000102
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention module
Figure BDA0003062928370000103
Double-scale classification vector of (1)
Figure BDA0003062928370000104
And
Figure BDA0003062928370000105
(3-3) the r-th double-scale space-time partitioning mutual attention module
Figure BDA0003062928370000106
The space-time block generation submodule of (a) will input
Figure BDA0003062928370000107
Z in (1)r,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform size
Figure BDA0003062928370000108
And small scale feature mapping
Figure BDA0003062928370000109
Wherein the height dimension
Figure BDA00030629283700001010
Width dimension
Figure BDA00030629283700001011
According to the height dimension hrWidth dimension wrTime dimension trWill be provided with
Figure BDA00030629283700001012
Performing space-time blocking to obtain the r group of large-scale space-time block feature tensors
Figure BDA00030629283700001013
Where j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:
Figure BDA00030629283700001014
and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namely
Figure BDA00030629283700001015
r≥2:
Then will be
Figure BDA00030629283700001016
Dimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time block
Figure BDA00030629283700001017
Wherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr
Will be provided with
Figure BDA00030629283700001018
And
Figure BDA00030629283700001019
splicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Figure BDA00030629283700001020
Obtaining the updated small-scale space-time block feature tensor element by the same operation
Figure BDA00030629283700001021
Wherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2
Obtaining the r group dual-scale space-time block feature tensor
Figure BDA00030629283700001022
And
Figure BDA00030629283700001023
(3-4) the r-th double-scale space-time partitioning mutual attention module
Figure BDA00030629283700001024
The input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submodule
Figure BDA00030629283700001025
And
Figure BDA00030629283700001026
the jth large-scale space-time block feature tensor elements of the r group
Figure BDA00030629283700001027
Linear mapping is carried out to obtain the query matrix of the target object at each attention head
Figure BDA00030629283700001028
Key matrix
Figure BDA00030629283700001029
Sum matrix
Figure BDA00030629283700001030
Wherein, attention headThe index a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix is
Figure BDA0003062928370000111
Calculating the corresponding multi-head space-time self-attention weight characteristics
Figure BDA0003062928370000112
Figure BDA0003062928370000113
Wherein Softmax (-) is a normalized exponential function;
use of
Figure BDA0003062928370000114
Learnable parameter
Figure BDA0003062928370000115
And calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Figure BDA0003062928370000116
Will be provided with
Figure BDA0003062928370000117
Decomposing to obtain updated large-scale space-time block classification vector
Figure BDA0003062928370000118
And large scale space-time block space-time feature matrix
Figure BDA0003062928370000119
MLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
Figure BDA00030629283700001110
Thereby obtaining the r-th group of space-time block space-time attention characteristic tensors
Figure BDA00030629283700001111
And
Figure BDA00030629283700001112
(3-5) the r-th double-scale space-time partitioning mutual attention module
Figure BDA00030629283700001113
The input of the scale mutual attention submodule is the output of the space-time attention submodule
Figure BDA00030629283700001114
And
Figure BDA00030629283700001115
wherein the jth group of jth dual-scale space-time block classification vectors are
Figure BDA00030629283700001116
And
Figure BDA00030629283700001117
the space-time feature matrix of the dual-scale space-time block is
Figure BDA00030629283700001118
And
Figure BDA00030629283700001119
classifying large scale space-time blocks into vectors
Figure BDA00030629283700001120
Linear mapping is carried out to obtain the query vector
Figure BDA00030629283700001121
Classifying large scale space-time blocks into vectors
Figure BDA00030629283700001122
With small scale space-time block space-time characteristic matrix
Figure BDA00030629283700001123
Linear mapping is carried out to obtain the key matrix
Figure BDA00030629283700001124
Sum matrix
Figure BDA00030629283700001125
Computing multi-headed spatiotemporal self-attention weight features
Figure BDA00030629283700001126
Use of
Figure BDA00030629283700001127
Learnable parameter
Figure BDA00030629283700001128
Calculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Figure BDA00030629283700001129
Thereby obtaining the r-th group of all large-scale space-time block classification vectors
Figure BDA00030629283700001130
Linear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Figure BDA00030629283700001131
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrix
Figure BDA00030629283700001132
Splicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
Figure BDA00030629283700001133
Same operation is carried out to obtain small-scale classification vectors
Figure BDA00030629283700001134
And small scale mutual attention feature matrix
Figure BDA0003062928370000121
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
Figure BDA0003062928370000122
Step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector; the method comprises the following steps:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoder
Figure BDA0003062928370000123
And
Figure BDA0003062928370000124
respectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptron
Figure BDA0003062928370000125
And a small scale score vector
Figure BDA0003062928370000126
(4-2) outputting the action class probability vector
Figure BDA0003062928370000127
Step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged; the method comprises the following steps:
(5-1) embedding the dual-scale features of the step (2)Forming an action recognition model by the input module, the double-scale space-time blocked mutual attention encoder in the step (3) and the action classification module in the step (4)
Figure BDA0003062928370000128
(5-2) motion recognition model
Figure BDA0003062928370000129
The input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectors
Figure BDA00030629283700001210
And
Figure BDA00030629283700001211
inputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy loss
Figure BDA00030629283700001212
Optimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,
Figure BDA00030629283700001213
is a real mark, if the action category of the classroom student video belongs to b,
Figure BDA00030629283700001214
otherwise
Figure BDA00030629283700001215
Step (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of student action; the method comprises the following steps:
(6-1) inputting the high-definition classroom student image dataset marked with the student position bounding box into an open source target detection model YOLOv5 pre-trained on the existing COCO2017 dataset, and iteratively training the model until the model converges to obtain a target detection model
Figure BDA00030629283700001216
(6-2) for a new classroom student video, the video frame sequence is obtained by the aid of the (1-1), and the first frame image is input into the target detection model
Figure BDA0003062928370000131
Obtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2)
Figure BDA0003062928370000132
Figure BDA0003062928370000133
Wherein phi is the serial number of the student, chi is the total number of the student,
Figure BDA0003062928370000134
representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each student
Figure BDA0003062928370000135
Inputting the motion recognition model obtained by training in the step (5)
Figure BDA0003062928370000136
In the method, motion class probability vectors of phi-th students are obtained
Figure BDA0003062928370000137
And the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims (7)

1. The classroom action identification method based on double-scale space-time block mutual attention is characterized in that the method firstly obtains high-definition classroom student video data and then carries out the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence;
constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation;
constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector;
step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector;
step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged;
and (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of the action of the student.
2. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 1, wherein the step (1) is specifically as follows:
(1-1) processing each high-definition classroom student video into a corresponding video frame sequence at a sampling rate of k frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 60k frames to obtain a high-definition classroom student image data set, wherein k is 15-30;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence
Figure FDA0003062928360000011
Figure FDA0003062928360000012
In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiThe frame sequence is represented by an RGB three-channel image with ith height H and width W, and T is the total frame number, namely T is 60 k.
3. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 2, wherein the step (2) is specifically as follows:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristics
Figure FDA0003062928360000021
Wherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporal
Figure FDA0003062928360000022
The height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t moment
Figure FDA0003062928360000023
And small scale block feature vectors
Figure FDA0003062928360000024
D represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrix
Figure FDA0003062928360000025
And small scale spatio-temporal feature matrix
Figure FDA0003062928360000026
[·,…,·]Representing a splicing operation; wherein the total number of large-scale space feature blocks
Figure FDA0003062928360000027
Total number of small-scale spatial feature blocks
Figure FDA0003062928360000028
Output dual scale spatiotemporal feature representation { Xl,Xs}。
4. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 3, wherein the step (3) is specifically as follows:
(3-1) the space-time block mutual attention encoder is composed of R space-time block mutual attention modules in series connection, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule(ii) a Input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention module
Figure FDA0003062928360000029
Input dual scale spatiotemporal feature tensor
Figure FDA00030629283600000210
Wherein the input large-scale space-time feature matrix
Figure FDA00030629283600000211
Input small scale space-time feature matrix
Figure FDA00030629283600000212
Figure FDA00030629283600000213
And
Figure FDA00030629283600000214
classifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention module
Figure FDA00030629283600000215
Output dual-scale mutual attention feature tensor
Figure FDA00030629283600000216
Wherein, the output large-scale mutual attention feature matrix
Figure FDA00030629283600000217
Output small-scale mutual attention feature matrix
Figure FDA00030629283600000218
Figure FDA00030629283600000219
And
Figure FDA00030629283600000220
for the output large scale classification vector and the small scale classification vector,
Figure FDA00030629283600000221
and
Figure FDA00030629283600000222
the large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrix
Figure FDA0003062928360000031
Input small scale space-time feature matrix
Figure FDA0003062928360000032
Large scale classification vector
Figure FDA0003062928360000033
And small scale classification vectors
Figure FDA0003062928360000034
Obtained by random initialization;
when R is more than or equal to R and more than 1, the input double-scale space-time feature tensor
Figure FDA0003062928360000035
For last space-time block mutual attention module
Figure FDA0003062928360000036
Output dual-scale mutual attention feature tensor
Figure FDA0003062928360000037
Namely, it is
Figure FDA0003062928360000038
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention module
Figure FDA0003062928360000039
Double-scale classification vector of (1)
Figure FDA00030629283600000310
And
Figure FDA00030629283600000311
(3-3) the r-th double-scale space-time partitioning mutual attention module
Figure FDA00030629283600000312
The space-time block generation submodule of (a) will input
Figure FDA00030629283600000313
Z in (1)r ,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform size
Figure FDA00030629283600000314
And small scale feature mapping
Figure FDA00030629283600000315
Wherein the height dimension
Figure FDA00030629283600000316
Width dimension
Figure FDA00030629283600000317
According to the height dimension hrWidth dimension wrTime dimension trWill be provided with
Figure FDA00030629283600000318
Performing space-time blocking to obtain the r group of large-scale space-time block feature tensors
Figure FDA00030629283600000319
Where j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:
Figure FDA00030629283600000320
and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namely
Figure FDA00030629283600000321
Then will be
Figure FDA00030629283600000322
Dimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time block
Figure FDA00030629283600000323
Wherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr
Will be provided with
Figure FDA00030629283600000324
And
Figure FDA00030629283600000325
splicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Figure FDA00030629283600000326
Obtaining the updated small-scale space-time block feature tensor element by the same operation
Figure FDA00030629283600000327
Wherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2
Obtaining the r group dual-scale space-time block feature tensor
Figure FDA00030629283600000328
And
Figure FDA00030629283600000329
(3-4) the r-th double-scale space-time partitioning mutual attention module
Figure FDA0003062928360000041
The input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submodule
Figure FDA0003062928360000042
And
Figure FDA0003062928360000043
the jth large-scale space-time block feature tensor elements of the r group
Figure FDA0003062928360000044
Linear mapping is carried out to obtain the query matrix of the target object at each attention head
Figure FDA0003062928360000045
Key matrix
Figure FDA0003062928360000046
Sum matrix
Figure FDA0003062928360000047
Wherein, the attention head sequence number a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix is
Figure FDA0003062928360000048
Calculating the corresponding multi-head space-time self-attention weight characteristics
Figure FDA0003062928360000049
Figure FDA00030629283600000410
Wherein Softmax (-) is a normalized exponential function;
use of
Figure FDA00030629283600000411
Learnable parameter
Figure FDA00030629283600000412
And calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Figure FDA00030629283600000413
Will be provided with
Figure FDA00030629283600000414
Decomposing to obtain updated large-scale space-time block classification vector
Figure FDA00030629283600000415
And large scale space-time block space-time feature matrix
Figure FDA00030629283600000416
MLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
Figure FDA00030629283600000417
Thereby obtaining the r-th group of space-time block space-time attention characteristic tensors
Figure FDA00030629283600000418
And
Figure FDA00030629283600000419
(3-5) the r-th double-scale space-time partitioning mutual attention module
Figure FDA00030629283600000420
The input of the scale mutual attention submodule is the output of the space-time attention submodule
Figure FDA00030629283600000421
And
Figure FDA00030629283600000422
wherein the jth group of jth dual-scale space-time block classification vectors are
Figure FDA00030629283600000423
And
Figure FDA00030629283600000424
the space-time feature matrix of the dual-scale space-time block is
Figure FDA00030629283600000425
And
Figure FDA00030629283600000426
classifying large scale space-time blocks into vectors
Figure FDA00030629283600000427
Linear mapping is carried out to obtain the query vector
Figure FDA00030629283600000428
Classifying large scale space-time blocks into vectors
Figure FDA00030629283600000429
With small scale space-time block space-time characteristic matrix
Figure FDA00030629283600000430
Linear mapping is carried out to obtain the key matrix
Figure FDA00030629283600000431
Sum matrix
Figure FDA00030629283600000432
Computing multi-headed spatiotemporal self-attention weight features
Figure FDA00030629283600000433
Use of
Figure FDA00030629283600000434
Learnable parameter
Figure FDA00030629283600000435
Calculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Figure FDA00030629283600000436
Thereby obtaining the r-th group of all large-scale space-time block classification vectors
Figure FDA0003062928360000051
Linear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Figure FDA0003062928360000052
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrix
Figure FDA0003062928360000053
Splicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
Figure FDA0003062928360000054
Same operation is carried out to obtain small-scale classification vectors
Figure FDA0003062928360000055
And small scale mutual attention feature matrix
Figure FDA0003062928360000056
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
Figure FDA0003062928360000057
5. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 4, wherein the step (4) is specifically as follows:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoder
Figure FDA0003062928360000058
And
Figure FDA0003062928360000059
respectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptron
Figure FDA00030629283600000510
And a small scale score vector
Figure FDA00030629283600000511
(4-2) outputting the action class probability vector
Figure FDA00030629283600000512
6. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 5, wherein the step (5) is specifically as follows:
(5-1) the motion recognition model is composed of the dual-scale feature embedding module in the step (2), the dual-scale space-time block mutual attention encoder in the step (3) and the motion classification module in the step (4)
Figure FDA00030629283600000513
(5-2) motion recognition model
Figure FDA00030629283600000514
The input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectors
Figure FDA00030629283600000515
And
Figure FDA00030629283600000516
inputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy loss
Figure FDA00030629283600000517
Optimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,
Figure FDA0003062928360000061
is a real mark, if the action category of the classroom student video belongs to b,
Figure FDA0003062928360000062
otherwise
Figure FDA0003062928360000063
7. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 6, wherein step (6) is specifically:
(6-1) inputting the high-definition classroom student image data set marked with the student position bounding box into a target detection model YOLOv5 pre-trained on a COCO2017 data set, and iteratively training the model until the model converges to obtain the target detection model
Figure FDA0003062928360000064
(6-2) for a new classroom student video, the video frame sequence is obtained by the aid of the (1-1), and the first frame image is input into the target detection model
Figure FDA0003062928360000065
Obtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2)
Figure FDA0003062928360000066
Figure FDA0003062928360000067
Wherein phi is the serial number of the student, chi is the total number of the student,
Figure FDA0003062928360000068
representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each student
Figure FDA0003062928360000069
Inputting the motion recognition model obtained by training in the step (5)
Figure FDA00030629283600000610
In the method, motion class probability vectors of phi-th students are obtained
Figure FDA00030629283600000611
And the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
CN202110518525.4A 2021-05-12 2021-05-12 Classroom action recognition method based on double-scale space-time block mutual attention Active CN113408343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110518525.4A CN113408343B (en) 2021-05-12 2021-05-12 Classroom action recognition method based on double-scale space-time block mutual attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110518525.4A CN113408343B (en) 2021-05-12 2021-05-12 Classroom action recognition method based on double-scale space-time block mutual attention

Publications (2)

Publication Number Publication Date
CN113408343A true CN113408343A (en) 2021-09-17
CN113408343B CN113408343B (en) 2022-05-13

Family

ID=77678584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110518525.4A Active CN113408343B (en) 2021-05-12 2021-05-12 Classroom action recognition method based on double-scale space-time block mutual attention

Country Status (1)

Country Link
CN (1) CN113408343B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887610A (en) * 2021-09-29 2022-01-04 内蒙古工业大学 Pollen image classification method based on cross attention distillation transducer
CN114373224A (en) * 2021-12-28 2022-04-19 华南理工大学 Fuzzy 3D skeleton action identification method and device based on self-supervision learning
CN114648722A (en) * 2022-04-07 2022-06-21 杭州电子科技大学 Action identification method based on video multipath space-time characteristic network
CN115273182A (en) * 2022-07-13 2022-11-01 苏州工业职业技术学院 Long video concentration degree prediction method and device
CN117292209A (en) * 2023-11-27 2023-12-26 之江实验室 Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization
CN118230259A (en) * 2024-05-24 2024-06-21 辽宁人人畅享科技有限公司 Practice teaching management system based on internet of things technology

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10089556B1 (en) * 2017-06-12 2018-10-02 Konica Minolta Laboratory U.S.A., Inc. Self-attention deep neural network for action recognition in surveillance videos
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN109902293A (en) * 2019-01-30 2019-06-18 华南理工大学 A kind of file classification method based on part with global mutually attention mechanism
CN111027377A (en) * 2019-10-30 2020-04-17 杭州电子科技大学 Double-flow neural network time sequence action positioning method
CN111611847A (en) * 2020-04-01 2020-09-01 杭州电子科技大学 Video motion detection method based on scale attention hole convolution network
CN112183269A (en) * 2020-09-18 2021-01-05 哈尔滨工业大学(深圳) Target detection method and system suitable for intelligent video monitoring

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10089556B1 (en) * 2017-06-12 2018-10-02 Konica Minolta Laboratory U.S.A., Inc. Self-attention deep neural network for action recognition in surveillance videos
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN109902293A (en) * 2019-01-30 2019-06-18 华南理工大学 A kind of file classification method based on part with global mutually attention mechanism
CN111027377A (en) * 2019-10-30 2020-04-17 杭州电子科技大学 Double-flow neural network time sequence action positioning method
CN111611847A (en) * 2020-04-01 2020-09-01 杭州电子科技大学 Video motion detection method based on scale attention hole convolution network
CN112183269A (en) * 2020-09-18 2021-01-05 哈尔滨工业大学(深圳) Target detection method and system suitable for intelligent video monitoring

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王洁然: "基于高低层特征融合与卷积注意力机制的视频动作识别方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
田帅: "基于注意力机制的序列数据分类模型研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887610A (en) * 2021-09-29 2022-01-04 内蒙古工业大学 Pollen image classification method based on cross attention distillation transducer
CN113887610B (en) * 2021-09-29 2024-02-02 内蒙古工业大学 Pollen image classification method based on cross-attention distillation transducer
CN114373224A (en) * 2021-12-28 2022-04-19 华南理工大学 Fuzzy 3D skeleton action identification method and device based on self-supervision learning
CN114648722A (en) * 2022-04-07 2022-06-21 杭州电子科技大学 Action identification method based on video multipath space-time characteristic network
CN114648722B (en) * 2022-04-07 2023-07-18 杭州电子科技大学 Motion recognition method based on video multipath space-time characteristic network
CN115273182A (en) * 2022-07-13 2022-11-01 苏州工业职业技术学院 Long video concentration degree prediction method and device
CN115273182B (en) * 2022-07-13 2023-07-11 苏州工业职业技术学院 Long video concentration prediction method and device
CN117292209A (en) * 2023-11-27 2023-12-26 之江实验室 Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization
CN117292209B (en) * 2023-11-27 2024-04-05 之江实验室 Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization
CN118230259A (en) * 2024-05-24 2024-06-21 辽宁人人畅享科技有限公司 Practice teaching management system based on internet of things technology
CN118230259B (en) * 2024-05-24 2024-07-16 辽宁人人畅享科技有限公司 Practice teaching management system based on internet of things technology

Also Published As

Publication number Publication date
CN113408343B (en) 2022-05-13

Similar Documents

Publication Publication Date Title
CN113408343B (en) Classroom action recognition method based on double-scale space-time block mutual attention
CN111709409B (en) Face living body detection method, device, equipment and medium
US11810366B1 (en) Joint modeling method and apparatus for enhancing local features of pedestrians
EP3968179A1 (en) Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device
Deng et al. MVF-Net: A multi-view fusion network for event-based object classification
CN112036447B (en) Zero-sample target detection system and learnable semantic and fixed semantic fusion method
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians
CN112001278A (en) Crowd counting model based on structured knowledge distillation and method thereof
CN113033276B (en) Behavior recognition method based on conversion module
CN111507275B (en) Video data time sequence information extraction method and device based on deep learning
CN111738355A (en) Image classification method and device with attention fused with mutual information and storage medium
CN112507920A (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN111368733B (en) Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
Zhang et al. Unsupervised depth estimation from monocular videos with hybrid geometric-refined loss and contextual attention
CN114187506B (en) Remote sensing image scene classification method of viewpoint-aware dynamic routing capsule network
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN113536926A (en) Human body action recognition method based on distance vector and multi-angle self-adaptive network
Zhang et al. Skeleton-based action recognition with attention and temporal graph convolutional network
CN112861848B (en) Visual relation detection method and system based on known action conditions
CN115496991A (en) Reference expression understanding method based on multi-scale cross-modal feature fusion
CN109726690B (en) Multi-region description method for learner behavior image based on DenseCap network
CN117557857B (en) Detection network light weight method combining progressive guided distillation and structural reconstruction
CN117333799B (en) Middle and primary school classroom behavior detection method and device based on deformable anchor frame
CN117372837A (en) Cross-modal knowledge migration method based on knowledge distillation and unsupervised training modes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant