CN113408343B - Classroom action recognition method based on double-scale space-time block mutual attention - Google Patents

Classroom action recognition method based on double-scale space-time block mutual attention Download PDF

Info

Publication number
CN113408343B
CN113408343B CN202110518525.4A CN202110518525A CN113408343B CN 113408343 B CN113408343 B CN 113408343B CN 202110518525 A CN202110518525 A CN 202110518525A CN 113408343 B CN113408343 B CN 113408343B
Authority
CN
China
Prior art keywords
scale
space
time
student
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110518525.4A
Other languages
Chinese (zh)
Other versions
CN113408343A (en
Inventor
李平
陈嘉
曹佳晨
徐向华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110518525.4A priority Critical patent/CN113408343B/en
Publication of CN113408343A publication Critical patent/CN113408343A/en
Application granted granted Critical
Publication of CN113408343B publication Critical patent/CN113408343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Strategic Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a classroom action identification method based on double-scale space-time blocking mutual attention. Firstly, preprocessing high-definition classroom student video data to obtain a student action video frame sequence; then constructing an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module, sequentially obtaining a double-scale space-time feature representation, a double-scale classification vector and an action classification probability vector, and performing iterative optimization on the action recognition model by using a random gradient descent algorithm; and inputting the preprocessed new classroom video into the model to obtain the classes of the student actions. The method not only uses space-time attention to model a plurality of groups of space-time blocks to capture multi-scale space-time information of off-line and on-line classroom student videos, but also can depict student picture information with different scales through a scale mutual attention mechanism, thereby improving the student action recognition accuracy of classroom videos.

Description

Classroom action recognition method based on double-scale space-time block mutual attention
Technical Field
The invention belongs to the technical field of video understanding and analysis, particularly relates to the technical field of motion recognition in video analysis, and relates to a classroom motion recognition method based on double-scale space-time block mutual attention.
Background
The traditional online class-leaving room is the main place for students to study and teachers to give lessons, and in recent years, online class-leaving especially in epidemic situations becomes a popular mode among teachers and students, and network live broadcast or advance recorded broadcast teaching is generally adopted. No matter the online class in the classroom or the online class using the network platform, the learning effect of the student is directly influenced by the quality of the teaching. The dilemma that is often encountered in practice is that teachers need to spend much energy on classroom discipline management in order to ensure the quality of classroom teaching, and cannot be put into teaching of teaching with full attention, which is particularly obvious in primary school classrooms. Therefore, a video action recognition technology is introduced to recognize actions of students in a classroom, the learning state of the students is sensed in real time, and an intelligent analysis report reflecting the classroom quality is provided. The classroom action recognition task takes the student action video frame sequence as input and outputs student action categories, and has wide application in scenes such as classroom teaching, self-service management, unmanned invigilation and the like. For example, in an unmanned invigilation environment, the classroom action recognition method can recognize the action of an examinee in real time, and the examinee can be investigated if a suspected cheating action occurs, so that the examination discipline is ensured. The main challenges are: it is difficult to unify the offline and online classroom motion recognition methods, there are students in different distances in the same video picture, and a large amount of calculation overhead is required for performing motion recognition on a plurality of students.
Currently, practical applications for classroom scene action recognition are few, and the existing method is mainly based on wearable equipment and skeleton information. However, the wearable device may cause discomfort to the student, which in turn affects the learning efficiency of the student; the method based on the skeleton information can identify fewer motion types, and the identification performance is very easily influenced by the shielding of objects such as tables, chairs and books. In addition, the traditional motion recognition method needs to encode the video frame into manual features (such as features of HOG3D, 3Dsurf and the like), but the manual features have great limitations and the extraction speed is low, so that the real-time requirement cannot be met. In recent years, an action recognition method with a Convolutional Neural Network (CNN) as a core can learn feature representation reflecting video latent semantic information end to end, and the accuracy of action recognition is greatly improved. In order to extract more effective visual features, a residual error network (ResNet) uses residual error connection to connect different layers of the network, so that the problems of overfitting, gradient disappearance or gradient explosion and the like generated during the training of a deeper neural network model are solved; a Non-Local Network (Non-Local Network) captures long-distance dependency relationship by using a Non-Local operation, establishes connection among pixel blocks at different distances of a video frame image through an attention mechanism, and mines semantic information among the pixel blocks. In addition, Transformer (Transformer) models, derived from the natural language processing domain, have recently been favored in the computer vision domain, where a lot of attention is paid to extracting the critical timing information of diversity in a sequence of video frames, so that the models can learn more discriminative feature representations.
The existing classroom action recognition technology still has many defects: firstly, designing a model separately for an offline classroom or an online classroom, and lacking a unified interface for fusing two types of classroom action identification methods; secondly, calculating space-time attention on all video frames in blocks when the features are extracted, neglecting the local characteristics of the space-time features to reduce the recognition rate, and calculating cost is overlarge when the video resolution is large; in addition, many methods only extract the space-time characteristics of single-scale blocks, and are difficult to adapt to the situation that the picture scales of individual students are different. In order to solve the problems of lack of a local space-time characteristic information exchange mechanism, adaptation to individual student pictures with different scales and the like, an efficient classroom action identification method which unifies an offline classroom and an online classroom and can improve student action identification accuracy is urgently needed.
Disclosure of Invention
The invention aims to provide a classroom action recognition method based on double-scale space-time block mutual attention aiming at the defects of the prior art, wherein a plurality of groups of space-time blocks are modeled by space-time attention so as to capture multi-scale space-time information of videos of students in offline and online classrooms, and the scale mutual attention is utilized to depict the picture information of the students in different scales so as to improve the recognition rate of classroom actions.
The method firstly acquires high-definition classroom student video data, and then sequentially performs the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence; the method comprises the following steps:
(1-1) processing each high-definition classroom student video into a corresponding video frame sequence at a sampling rate of k frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 60k frames to obtain a high-definition classroom student image data set, wherein k is 15-30;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence
Figure GDA0003577556470000021
Figure GDA0003577556470000022
In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiThe frame sequence is represented by an RGB three-channel image with ith height H and width W, and T is the total frame number, namely T is 60 k.
Constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation; the method comprises the following steps:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristics
Figure GDA0003577556470000023
Wherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporal
Figure GDA0003577556470000031
The height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t moment
Figure GDA0003577556470000032
And small scale block feature vectors
Figure GDA0003577556470000033
D represents the dimension of the characteristic vector, L and S are the size of the blocking scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrix
Figure GDA0003577556470000034
And small scale spatio-temporal feature matrix
Figure GDA0003577556470000035
[·,…,·]Representing a splicing operation; wherein the total number of large-scale space feature blocks
Figure GDA0003577556470000036
Total number of small-scale spatial feature blocks
Figure GDA0003577556470000037
Output dual scale spatiotemporal feature representation { Xl,Xs}。
Constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector; the method comprises the following steps:
(3-1) the space-time block mutual attention encoder is formed by connecting R space-time block mutual attention modules in series, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule; input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention module
Figure GDA0003577556470000038
Input dual scale spatiotemporal feature tensor
Figure GDA0003577556470000039
Wherein the content of the first and second substances,input large-scale space-time feature matrix
Figure GDA00035775564700000310
Input small scale space-time feature matrix
Figure GDA00035775564700000311
Figure GDA00035775564700000312
And
Figure GDA00035775564700000313
classifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention module
Figure GDA00035775564700000314
Output dual-scale mutual attention feature tensor
Figure GDA00035775564700000315
Wherein, the output large-scale mutual attention feature matrix
Figure GDA00035775564700000316
Output small-scale mutual attention feature matrix
Figure GDA00035775564700000317
Figure GDA00035775564700000318
And
Figure GDA00035775564700000319
for the output large scale classification vector and the small scale classification vector,
Figure GDA00035775564700000320
and
Figure GDA00035775564700000321
large scale space-time feature matrix and small scale time for outputA null feature matrix;
when r is 1, the input large-scale space-time characteristic matrix
Figure GDA00035775564700000322
Input small scale space-time feature matrix
Figure GDA00035775564700000323
Large scale classification vector
Figure GDA00035775564700000324
And small scale classification vectors
Figure GDA00035775564700000325
Obtained by random initialization;
when R is more than or equal to R and more than 1, the input double-scale space-time feature tensor
Figure GDA00035775564700000326
For last space-time block mutual attention module
Figure GDA00035775564700000327
Output dual-scale mutual attention feature tensor
Figure GDA0003577556470000041
Namely, it is
Figure GDA0003577556470000042
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention module
Figure GDA0003577556470000043
Double-scale classification vector of (1)
Figure GDA0003577556470000044
And
Figure GDA0003577556470000045
(3-3) the r thDouble-scale space-time partitioning mutual attention module
Figure GDA0003577556470000046
The space-time block generation submodule of (a) will input
Figure GDA0003577556470000047
Z in (1)r,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform size
Figure GDA0003577556470000048
And small scale feature mapping
Figure GDA0003577556470000049
Wherein the height dimension
Figure GDA00035775564700000410
Width dimension
Figure GDA00035775564700000411
According to the height dimension hrWidth dimension wrTime dimension trWill be provided with
Figure GDA00035775564700000412
Performing space-time blocking to obtain the r group of large-scale space-time block feature tensors
Figure GDA00035775564700000413
Where j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:
Figure GDA00035775564700000414
and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namely
Figure GDA00035775564700000415
Then will be
Figure GDA00035775564700000416
Dimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time block
Figure GDA00035775564700000417
Wherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr
Will be provided with
Figure GDA00035775564700000418
And
Figure GDA00035775564700000419
splicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Figure GDA00035775564700000420
Obtaining the updated small-scale space-time block feature tensor element by the same operation
Figure GDA00035775564700000421
Wherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2
Obtaining the r group dual-scale space-time block feature tensor
Figure GDA00035775564700000422
And
Figure GDA00035775564700000423
(3-4) the r-th double-scale space-time partitioning mutual attention module
Figure GDA00035775564700000424
The input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submodule
Figure GDA00035775564700000425
And
Figure GDA00035775564700000426
the jth large-scale space-time block feature tensor elements of the r group
Figure GDA00035775564700000427
Linear mapping is carried out to obtain the query matrix of the target object at each attention head
Figure GDA00035775564700000428
Key matrix
Figure GDA00035775564700000429
Sum matrix
Figure GDA00035775564700000430
Wherein, the attention head sequence number a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix is
Figure GDA0003577556470000051
Calculating the corresponding multi-head space-time self-attention weight characteristics
Figure GDA0003577556470000052
Wherein Softmax (-) is a normalized exponential function;
use of
Figure GDA0003577556470000053
Learnable parameter
Figure GDA0003577556470000054
And calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Figure GDA0003577556470000055
Will be provided with
Figure GDA0003577556470000056
Decomposing to obtain updated large-scale space-time block classification vector
Figure GDA0003577556470000057
And large scale space-time block space-time feature matrix
Figure GDA0003577556470000058
MLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
Figure GDA0003577556470000059
Thereby obtaining the r-th group of space-time block space-time attention characteristic tensors
Figure GDA00035775564700000510
And
Figure GDA00035775564700000511
(3-5) the r-th double-scale space-time partitioning mutual attention module
Figure GDA00035775564700000512
The input of the scale mutual attention submodule is the output of the space-time attention submodule
Figure GDA00035775564700000513
And
Figure GDA00035775564700000514
wherein the jth group of jth dual-scale space-time block classification vectors are
Figure GDA00035775564700000515
And
Figure GDA00035775564700000516
the space-time feature matrix of the dual-scale space-time block is
Figure GDA00035775564700000517
And
Figure GDA00035775564700000518
classifying large scale space-time blocks into vectors
Figure GDA00035775564700000519
Linear mapping is carried out to obtain the query vector
Figure GDA00035775564700000520
Characterizing small scale space-time blocks with space-time attention feature matrix
Figure GDA00035775564700000521
Performing linear mapping twice to obtain key matrixes thereof respectively
Figure GDA00035775564700000522
Sum matrix
Figure GDA00035775564700000523
Computing multi-headed spatiotemporal self-attention weight features
Figure GDA00035775564700000524
Use of
Figure GDA00035775564700000525
Learnable parameter
Figure GDA00035775564700000526
Calculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Figure GDA00035775564700000527
Thereby obtaining the r-th group of all large-scale space-time block classification vectors
Figure GDA00035775564700000528
Linear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Figure GDA00035775564700000529
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrix
Figure GDA00035775564700000530
Splicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
Figure GDA00035775564700000531
Same operation is carried out to obtain small-scale classification vectors
Figure GDA00035775564700000532
And small scale mutual attention feature matrix
Figure GDA0003577556470000061
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
Figure GDA0003577556470000062
Step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector;
step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged;
and (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of the action of the student.
Further, the step (4) is specifically:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoder
Figure GDA0003577556470000063
And
Figure GDA0003577556470000064
respectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptron
Figure GDA0003577556470000065
And a small scale score vector
Figure GDA0003577556470000066
(4-2) outputting the action class probability vector
Figure GDA0003577556470000067
Further, the step (5) is specifically:
(5-1) the motion recognition model is composed of the dual-scale feature embedding module in the step (2), the dual-scale space-time block mutual attention encoder in the step (3) and the motion classification module in the step (4)
Figure GDA0003577556470000068
(5-2) motion recognition model
Figure GDA0003577556470000069
The input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectors
Figure GDA00035775564700000610
And
Figure GDA00035775564700000611
inputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy loss
Figure GDA00035775564700000612
Optimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,
Figure GDA00035775564700000613
is a real mark, if the action category of the classroom student video belongs to b,
Figure GDA00035775564700000614
otherwise
Figure GDA00035775564700000615
Still further, the step (6) is specifically:
(6-1) inputting the high-definition classroom student image data set marked with the student position bounding box into a target detection model YOLOv5 pre-trained on a COCO2017 data set, and iteratively training the model until the model converges to obtain the target detection model
Figure GDA0003577556470000071
(6-2) for a new class student video, acquiring a video frame sequence by using the (1-1), and inputting a first frame image into the target detection model
Figure GDA0003577556470000072
Obtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2)
Figure GDA0003577556470000073
Figure GDA0003577556470000074
Wherein phi is the serial number of the student, chi is the total number of the student,
Figure GDA0003577556470000075
representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each student
Figure GDA0003577556470000076
Inputting the motion recognition model obtained by training in the step (5)
Figure GDA0003577556470000077
In the method, motion class probability vectors of phi-th students are obtained
Figure GDA0003577556470000078
And the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
The method of the invention utilizes a double-scale space-time block mutual attention encoder to identify the student action in the student video, and has the following characteristics: 1) different from the existing method which only designs for offline or online classes, the method of the invention firstly utilizes the target detection model to obtain the action frame sequence of students, and then further identifies the action category of each student, and can be generally used in the application scenes of the offline classes and the online classes; 2) different from the existing method for calculating the space-time attention of all video frame blocks during each step of feature extraction, the method of the invention uses a space-time block generation submodule and a space-time attention submodule to extract the space-time features in a plurality of groups of space-time blocks so as to realize local space-time feature information exchange and greatly reduce the calculation overhead; 3) the method of the invention uses two different sizes to block the video frame, and combines the scale mutual attention sub-module, so as to better extract the action information of the individual student pictures with different scales in the video.
The method is suitable for action recognition under the complex classroom scene with participation of a plurality of students and different picture scales of individual students, and has the advantages that: 1) the method unifies the action recognition methods of the offline classroom and the online classroom, and reduces the technical cost of applying the action recognition method to the two classes; 2) extracting features of a plurality of different empty regions through a space-time block generation submodule and a space-time attention submodule, and fully considering the local characteristics of space-time features to obtain more accurate identification categories and improve the calculation efficiency; 3) the scale mutual attention submodule learns the individual student pictures of different scales, and the space-time characteristics under the two scale blocks are fully fused to obtain better recognition performance. The invention has the capability of local space-time characteristic learning and the capability of capturing individual student picture space characteristics with different scales, and can improve the student action recognition rate in practical application scenes such as classroom teaching supervision, self-service class management, unmanned invigilation and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
A classroom action recognition method based on double-scale space-time block mutual attention is characterized by firstly sampling classroom student videos to obtain video frame sequences of the classroom students, obtaining a boundary frame of each student position by using a target detection model, further intercepting frame images in the boundary frame to obtain the student action video frame sequences, then constructing an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module, and finally judging classes of student actions by using the action recognition model. The method utilizes a target detection model to obtain a student action frame sequence to further identify action which can be commonly used in an off-line class and an on-line class, utilizes a space-time block generation submodule and a space-time attention submodule to extract space-time characteristics of a plurality of groups of space-time blocks so as to realize local space-time characteristic information exchange, and utilizes two block scales and scale mutual attention submodules to capture action information of different scales so as to adapt to the condition that the picture scales of students are different. The classroom action recognition system constructed in the mode can be uniformly deployed and applied to two classes, and meanwhile, the spatiotemporal information of student action video frames can be effectively extracted and student action categories can be efficiently recognized.
As shown in fig. 1, the method first obtains high definition classroom student video data, and then performs the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence; the method comprises the following steps:
(1-1) processing each online or offline high-definition classroom student video into a corresponding video frame sequence at a sampling rate of 25 frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 1500 frames per minute to obtain a high-definition classroom student image data set;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence
Figure GDA0003577556470000081
Figure GDA0003577556470000082
In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiThe frame sequence is represented by an RGB three-channel image with ith height H and width W, and T is the total frame number, namely T is 1500.
Constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation; the method comprises the following steps:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristics
Figure GDA0003577556470000083
Wherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3)to pooled space-time features
Figure GDA0003577556470000084
The height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t moment
Figure GDA0003577556470000091
And small scale block feature vectors
Figure GDA0003577556470000092
D represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrix
Figure GDA0003577556470000093
And small scale spatio-temporal feature matrix
Figure GDA0003577556470000094
[·,…,·]Representing a splicing operation; wherein the total number of large-scale space feature blocks
Figure GDA0003577556470000095
Total number of small-scale spatial feature blocks
Figure GDA0003577556470000096
Output dual scale spatiotemporal feature representation { Xl,Xs}。
Constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector; the method comprises the following steps:
(3-1) the space-time block mutual attention encoder is formed by connecting R space-time block mutual attention modules in series, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule; input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention module
Figure GDA0003577556470000097
Input dual scale spatiotemporal feature tensor
Figure GDA0003577556470000098
Wherein the input large-scale space-time feature matrix
Figure GDA0003577556470000099
Input small scale space-time feature matrix
Figure GDA00035775564700000910
Figure GDA00035775564700000911
And
Figure GDA00035775564700000912
classifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention module
Figure GDA00035775564700000913
Output dual-scale mutual attention feature tensor
Figure GDA00035775564700000914
Wherein, the output large-scale mutual attention feature matrix
Figure GDA00035775564700000915
Output small-scale mutual attention feature matrix
Figure GDA00035775564700000916
Figure GDA00035775564700000917
And
Figure GDA00035775564700000918
for the output large scale classification vector and the small scale classification vector,
Figure GDA00035775564700000919
and
Figure GDA00035775564700000920
the large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrix
Figure GDA00035775564700000921
Input small scale space-time feature matrix
Figure GDA00035775564700000922
Large scale classification vector
Figure GDA00035775564700000923
And small scale classification vectors
Figure GDA00035775564700000924
Obtained by random initialization;
when R is more than or equal to R and more than 1, the input double-scale space-time feature tensor
Figure GDA00035775564700000925
For last space-time block mutual attention module
Figure GDA00035775564700000926
Output dual-scale mutual attention feature tensor
Figure GDA00035775564700000927
Namely, it is
Figure GDA00035775564700000928
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention module
Figure GDA00035775564700000929
Double-scale classification vector of (1)
Figure GDA0003577556470000101
And
Figure GDA0003577556470000102
(3-3) the r-th double-scale space-time block mutual attention module
Figure GDA0003577556470000103
The space-time block generation submodule of (2) will input
Figure GDA0003577556470000104
Z in (1)r,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform size
Figure GDA0003577556470000105
And small scale feature mapping
Figure GDA0003577556470000106
Wherein the height dimension
Figure GDA0003577556470000107
Width dimension
Figure GDA0003577556470000108
According to the height dimension hrWidth dimension wrTime dimension trWill be provided with
Figure GDA0003577556470000109
Performing space-time block division to obtain the r group of large scale space-time block feature tensors
Figure GDA00035775564700001010
Where j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:
Figure GDA00035775564700001011
and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namely
Figure GDA00035775564700001012
Then will be
Figure GDA00035775564700001013
Dimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time block
Figure GDA00035775564700001014
Wherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr
Will be provided with
Figure GDA00035775564700001015
And
Figure GDA00035775564700001016
splicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Figure GDA00035775564700001017
Obtaining the updated small-scale space-time block feature tensor element by the same operation
Figure GDA00035775564700001018
Wherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2
Obtaining the r group dual-scale space-time block feature tensor
Figure GDA00035775564700001019
And
Figure GDA00035775564700001020
(3-4) the r-th bisScale space-time blocking mutual attention module
Figure GDA00035775564700001021
The input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submodule
Figure GDA00035775564700001022
And
Figure GDA00035775564700001023
the jth large-scale space-time block feature tensor elements of the r group
Figure GDA00035775564700001024
Linear mapping is carried out to obtain the query matrix of the target object at each attention head
Figure GDA00035775564700001025
Key matrix
Figure GDA00035775564700001026
Sum matrix
Figure GDA00035775564700001027
Wherein, the attention head sequence number a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix is
Figure GDA00035775564700001028
Calculating the corresponding multi-head space-time self-attention weight characteristics
Figure GDA00035775564700001029
Wherein Softmax (-) is a normalized exponential function;
use of
Figure GDA0003577556470000111
Learnable parameter
Figure GDA0003577556470000112
And calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Figure GDA0003577556470000113
Will be provided with
Figure GDA0003577556470000114
Decomposing to obtain updated large-scale space-time block classification vector
Figure GDA0003577556470000115
And large scale space-time block space-time feature matrix
Figure GDA0003577556470000116
MLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
Figure GDA0003577556470000117
Thereby obtaining the r-th group of space-time block space-time attention characteristic tensors
Figure GDA0003577556470000118
And
Figure GDA0003577556470000119
(3-5) the r-th double-scale space-time partitioning mutual attention module
Figure GDA00035775564700001110
The input of the scale mutual attention submodule is the output of the space-time attention submodule
Figure GDA00035775564700001111
And
Figure GDA00035775564700001112
wherein the jth group of jth dual-scale space-time block classification vectors are
Figure GDA00035775564700001113
And
Figure GDA00035775564700001114
the space-time feature matrix of the dual-scale space-time block is
Figure GDA00035775564700001115
And with
Figure GDA00035775564700001116
Classifying large scale space-time blocks into vectors
Figure GDA00035775564700001117
Linear mapping is carried out to obtain the query vector
Figure GDA00035775564700001118
Characterizing small scale space-time blocks with space-time attention feature matrix
Figure GDA00035775564700001119
Performing linear mapping twice to obtain key matrixes thereof respectively
Figure GDA00035775564700001120
Sum matrix
Figure GDA00035775564700001121
Computing multi-headed spatiotemporal self-attention weight features
Figure GDA00035775564700001122
Use of
Figure GDA00035775564700001123
Learnable parameter
Figure GDA00035775564700001124
Calculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Figure GDA00035775564700001125
Thereby obtaining the r-th group of all large-scale space-time block classification vectors
Figure GDA00035775564700001126
Linear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Figure GDA00035775564700001127
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrix
Figure GDA00035775564700001128
Splicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
Figure GDA00035775564700001129
Same operation is carried out to obtain small-scale classification vectors
Figure GDA00035775564700001130
And small scale mutual attention feature matrix
Figure GDA00035775564700001131
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
Figure GDA00035775564700001132
Step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector; the method comprises the following steps:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoder
Figure GDA0003577556470000121
And
Figure GDA0003577556470000122
respectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptron
Figure GDA0003577556470000123
And a small scale score vector
Figure GDA0003577556470000124
(4-2) outputting the action class probability vector
Figure GDA0003577556470000125
Step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged; the method comprises the following steps:
(5-1) the motion recognition model is composed of the dual-scale feature embedding module in the step (2), the dual-scale space-time block mutual attention encoder in the step (3) and the motion classification module in the step (4)
Figure GDA0003577556470000126
(5-2) motion recognition model
Figure GDA0003577556470000127
The input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectors
Figure GDA0003577556470000128
And
Figure GDA0003577556470000129
inputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy loss
Figure GDA00035775564700001210
Optimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,
Figure GDA00035775564700001211
is a real mark, if the action category of the classroom student video belongs to b,
Figure GDA00035775564700001212
otherwise
Figure GDA00035775564700001213
Step (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of student action; the method comprises the following steps:
(6-1) inputting the high-definition classroom student image dataset marked with the student position bounding box into an open source target detection model YOLOv5 pre-trained on the existing COCO2017 dataset, and iteratively training the model until the model converges to obtain a target detection model
Figure GDA00035775564700001214
(6-2) for a new class student video, acquiring a video frame sequence by using the (1-1), and inputting a first frame image into the target detection model
Figure GDA00035775564700001215
Obtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2)
Figure GDA00035775564700001216
Figure GDA0003577556470000131
Wherein phi is the serial number of the student, chi is the total number of the student,
Figure GDA0003577556470000132
representing an RGB three-channel image with the ith height H and the width W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each student
Figure GDA0003577556470000133
Inputting the motion recognition model obtained by training in the step (5)
Figure GDA0003577556470000134
In the method, motion class probability vectors of phi-th students are obtained
Figure GDA0003577556470000135
And the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.

Claims (4)

1. The classroom action identification method based on double-scale space-time block mutual attention is characterized in that the method firstly obtains high-definition classroom student video data and then carries out the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence; the method comprises the following steps:
(1-1) processing each high-definition classroom student video into a corresponding video frame sequence at a sampling rate of k frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 60k frames to obtain a high-definition classroom student image data set, wherein k is 15-30;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence
Figure FDA0003577556460000011
Figure FDA0003577556460000012
In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiRepresenting an ith RGB three-channel image with the height of H and the width of W in a frame sequence, wherein T is the total frame number, namely T is 60 k;
constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation; the method comprises the following steps:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristics
Figure FDA0003577556460000013
Wherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporal
Figure FDA0003577556460000014
The height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the pth block at the tth moment
Figure FDA0003577556460000015
And small scale block feature vectors
Figure FDA0003577556460000016
D represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrix
Figure FDA0003577556460000017
And small scale spatio-temporal feature matrix
Figure FDA0003577556460000018
[·,…,·]Representing a splicing operation; wherein the total number of large-scale space feature blocks
Figure FDA0003577556460000021
Total number of small-scale spatial feature blocks
Figure FDA0003577556460000022
Output dual scale spatiotemporal feature representation { Xl,Xs};
Constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector; the method comprises the following steps:
(3-1) the space-time block mutual attention encoder is formed by connecting R space-time block mutual attention modules in series, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule; input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention module
Figure FDA0003577556460000023
Input dual scale space-timeFeature tensor
Figure FDA0003577556460000024
Wherein the input large-scale space-time feature matrix
Figure FDA0003577556460000025
Input small scale space-time feature matrix
Figure FDA0003577556460000026
Figure FDA0003577556460000027
And
Figure FDA0003577556460000028
classifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention module
Figure FDA0003577556460000029
Output dual-scale mutual attention feature tensor
Figure FDA00035775564600000210
Wherein, the output large-scale mutual attention feature matrix
Figure FDA00035775564600000211
Output small-scale mutual attention feature matrix
Figure FDA00035775564600000212
Figure FDA00035775564600000213
And
Figure FDA00035775564600000214
for the output large scale classification vector and the small scale classification vector,
Figure FDA00035775564600000215
and
Figure FDA00035775564600000216
the large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrix
Figure FDA00035775564600000217
Input small scale space-time feature matrix
Figure FDA00035775564600000218
Large scale classification vector
Figure FDA00035775564600000219
And small scale classification vectors
Figure FDA00035775564600000220
Obtained by random initialization;
when R is more than or equal to R and more than 1, the input double-scale space-time feature tensor
Figure FDA00035775564600000221
For last space-time block mutual attention module
Figure FDA00035775564600000222
Output dual-scale mutual attention feature tensor
Figure FDA00035775564600000223
Namely, it is
Figure FDA00035775564600000224
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention module
Figure FDA00035775564600000225
Double-scale classification vector of (1)
Figure FDA00035775564600000226
And
Figure FDA00035775564600000227
(3-3) the r-th double-scale space-time partitioning mutual attention module
Figure FDA00035775564600000228
The space-time block generation submodule of (a) will input
Figure FDA00035775564600000229
Z in (1)r ,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform size
Figure FDA00035775564600000230
And small scale feature mapping
Figure FDA00035775564600000231
Wherein the height dimension
Figure FDA00035775564600000232
Width dimension
Figure FDA00035775564600000233
According to the height dimension hrWidth dimension wrTime dimension trWill be provided with
Figure FDA0003577556460000031
Performing space-time blocking to obtain the r group of large-scale space-time block feature tensors
Figure FDA0003577556460000032
Where j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:
Figure FDA0003577556460000033
and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namely
Figure FDA0003577556460000034
r≥2:
Then will be
Figure FDA0003577556460000035
Dimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time block
Figure FDA0003577556460000036
Wherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr
Will be provided with
Figure FDA0003577556460000037
And
Figure FDA0003577556460000038
splicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Figure FDA0003577556460000039
Obtaining the updated small-scale space-time block feature tensor element by the same operation
Figure FDA00035775564600000310
Wherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2
Obtaining the r group dual-scale space-time block feature tensor
Figure FDA00035775564600000311
And
Figure FDA00035775564600000312
(3-4) the r-th double-scale space-time partitioning mutual attention module
Figure FDA00035775564600000313
The input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submodule
Figure FDA00035775564600000314
And
Figure FDA00035775564600000315
the jth large-scale space-time block feature tensor elements of the r group
Figure FDA00035775564600000316
Linear mapping is carried out to obtain the query matrix of the target object at each attention head
Figure FDA00035775564600000317
Key matrix
Figure FDA00035775564600000318
Sum matrix
Figure FDA00035775564600000319
Wherein, the attention head sequence number a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix is
Figure FDA00035775564600000320
Calculating the corresponding multi-head space-time self-attention weight characteristics
Figure FDA00035775564600000321
Wherein Softmax (-) is a normalized exponential function;
use of
Figure FDA00035775564600000322
Learnable parameter
Figure FDA00035775564600000323
And calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Figure FDA00035775564600000324
Will be provided with
Figure FDA00035775564600000325
Decomposing to obtain updated large-scale space-time block classification vector
Figure FDA00035775564600000326
And large scale space-time block space-time feature matrix
Figure FDA00035775564600000327
MLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
Figure FDA0003577556460000041
Thereby obtaining the r-th group of space-time block space-time attention characteristic tensors
Figure FDA0003577556460000042
And
Figure FDA0003577556460000043
(3-5) the r-th double-scale space-time partitioning mutual attention module
Figure FDA0003577556460000044
The input of the scale mutual attention submodule is the output of the space-time attention submodule
Figure FDA0003577556460000045
And
Figure FDA0003577556460000046
wherein the jth group of jth dual-scale space-time block classification vectors are
Figure FDA0003577556460000047
And
Figure FDA0003577556460000048
the space-time feature matrix of the dual-scale space-time block is
Figure FDA0003577556460000049
And with
Figure FDA00035775564600000410
Classifying large scale space-time blocks into vectors
Figure FDA00035775564600000411
Linear mapping is carried out to obtain the query vector
Figure FDA00035775564600000412
Characterizing small scale space-time blocks with space-time attention feature matrix
Figure FDA00035775564600000413
Performing linear mapping twice to obtain key matrixes thereof respectively
Figure FDA00035775564600000414
Sum matrix
Figure FDA00035775564600000415
Computing multi-head spatiotemporal self-attention weightsFeature(s)
Figure FDA00035775564600000416
Use of
Figure FDA00035775564600000417
Learnable parameter
Figure FDA00035775564600000418
Calculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Figure FDA00035775564600000419
Thereby obtaining the r-th group of all large-scale space-time block classification vectors
Figure FDA00035775564600000420
Linear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Figure FDA00035775564600000421
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrix
Figure FDA00035775564600000422
Splicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
Figure FDA00035775564600000423
Same operation is carried out to obtain small-scale classification vectors
Figure FDA00035775564600000424
And small scale mutual attention feature matrix
Figure FDA00035775564600000425
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
Figure FDA00035775564600000426
Step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector;
step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged;
and (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of the action of the student.
2. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 1, wherein the step (4) is specifically as follows:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoder
Figure FDA0003577556460000051
And
Figure FDA0003577556460000052
respectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptron
Figure FDA0003577556460000053
And a small scale score vector
Figure FDA0003577556460000054
(4-2) outputting the action class probability vector
Figure FDA0003577556460000055
3. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 2, wherein the step (5) is specifically as follows:
(5-1) the motion recognition model is composed of the dual-scale feature embedding module in the step (2), the dual-scale space-time block mutual attention encoder in the step (3) and the motion classification module in the step (4)
Figure FDA0003577556460000056
(5-2) motion recognition model
Figure FDA0003577556460000057
The input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectors
Figure FDA0003577556460000058
And
Figure FDA0003577556460000059
inputting the double-scale classification vectors into an action classification module, and outputting probability vectors of action classes to which the actions of the students belong;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy loss
Figure FDA00035775564600000510
Optimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,
Figure FDA00035775564600000511
is a real mark, if the action category of the classroom student video belongs to b,
Figure FDA00035775564600000512
otherwise
Figure FDA00035775564600000513
4. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 3, wherein step (6) is specifically:
(6-1) inputting the high-definition classroom student image data set marked with the student position bounding box into a target detection model YOLOv5 pre-trained on a COCO2017 data set, and iteratively training the model until the model converges to obtain the target detection model
Figure FDA00035775564600000514
(6-2) for a new class student video, acquiring a video frame sequence by using the (1-1), and inputting a first frame image into the target detection model
Figure FDA00035775564600000515
Obtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2)
Figure FDA00035775564600000516
Figure FDA0003577556460000061
Wherein phi is the serial number of the student, chi is the total number of the student,
Figure FDA0003577556460000062
RGB three-channel image representing ith height H and width W in phi-th student frame sequence;
(6-3) moving video frame sequence for each student
Figure FDA0003577556460000063
Inputting the motion recognition model obtained by training in the step (5)
Figure FDA0003577556460000064
In the method, motion class probability vectors of phi-th students are obtained
Figure FDA0003577556460000065
And the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
CN202110518525.4A 2021-05-12 2021-05-12 Classroom action recognition method based on double-scale space-time block mutual attention Active CN113408343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110518525.4A CN113408343B (en) 2021-05-12 2021-05-12 Classroom action recognition method based on double-scale space-time block mutual attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110518525.4A CN113408343B (en) 2021-05-12 2021-05-12 Classroom action recognition method based on double-scale space-time block mutual attention

Publications (2)

Publication Number Publication Date
CN113408343A CN113408343A (en) 2021-09-17
CN113408343B true CN113408343B (en) 2022-05-13

Family

ID=77678584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110518525.4A Active CN113408343B (en) 2021-05-12 2021-05-12 Classroom action recognition method based on double-scale space-time block mutual attention

Country Status (1)

Country Link
CN (1) CN113408343B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113887610B (en) * 2021-09-29 2024-02-02 内蒙古工业大学 Pollen image classification method based on cross-attention distillation transducer
CN114648722B (en) * 2022-04-07 2023-07-18 杭州电子科技大学 Motion recognition method based on video multipath space-time characteristic network
CN115273182B (en) * 2022-07-13 2023-07-11 苏州工业职业技术学院 Long video concentration prediction method and device
CN117292209B (en) * 2023-11-27 2024-04-05 之江实验室 Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10089556B1 (en) * 2017-06-12 2018-10-02 Konica Minolta Laboratory U.S.A., Inc. Self-attention deep neural network for action recognition in surveillance videos
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN109902293A (en) * 2019-01-30 2019-06-18 华南理工大学 A kind of file classification method based on part with global mutually attention mechanism
CN111027377A (en) * 2019-10-30 2020-04-17 杭州电子科技大学 Double-flow neural network time sequence action positioning method
CN111611847A (en) * 2020-04-01 2020-09-01 杭州电子科技大学 Video motion detection method based on scale attention hole convolution network
CN112183269A (en) * 2020-09-18 2021-01-05 哈尔滨工业大学(深圳) Target detection method and system suitable for intelligent video monitoring

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10089556B1 (en) * 2017-06-12 2018-10-02 Konica Minolta Laboratory U.S.A., Inc. Self-attention deep neural network for action recognition in surveillance videos
CN109389055A (en) * 2018-09-21 2019-02-26 西安电子科技大学 Video classification methods based on mixing convolution sum attention mechanism
CN109902293A (en) * 2019-01-30 2019-06-18 华南理工大学 A kind of file classification method based on part with global mutually attention mechanism
CN111027377A (en) * 2019-10-30 2020-04-17 杭州电子科技大学 Double-flow neural network time sequence action positioning method
CN111611847A (en) * 2020-04-01 2020-09-01 杭州电子科技大学 Video motion detection method based on scale attention hole convolution network
CN112183269A (en) * 2020-09-18 2021-01-05 哈尔滨工业大学(深圳) Target detection method and system suitable for intelligent video monitoring

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于注意力机制的序列数据分类模型研究;田帅;《中国优秀硕士学位论文全文数据库信息科技辑》;20191216;第9-48页 *
基于高低层特征融合与卷积注意力机制的视频动作识别方法研究;王洁然;《中国优秀硕士学位论文全文数据库信息科技辑》;20200216;第18-69页 *

Also Published As

Publication number Publication date
CN113408343A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN113408343B (en) Classroom action recognition method based on double-scale space-time block mutual attention
CN111709409B (en) Face living body detection method, device, equipment and medium
US20210342643A1 (en) Method, apparatus, and electronic device for training place recognition model
WO2022252272A1 (en) Transfer learning-based method for improved vgg16 network pig identity recognition
CN112036447B (en) Zero-sample target detection system and learnable semantic and fixed semantic fusion method
CN111612051B (en) Weak supervision target detection method based on graph convolution neural network
US11810366B1 (en) Joint modeling method and apparatus for enhancing local features of pedestrians
CN112001278A (en) Crowd counting model based on structured knowledge distillation and method thereof
WO2023273668A1 (en) Image classification method and apparatus, device, storage medium, and program product
CN111738355A (en) Image classification method and device with attention fused with mutual information and storage medium
CN112712052A (en) Method for detecting and identifying weak target in airport panoramic video
CN111368733B (en) Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal
CN112507920A (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
CN114187506A (en) Remote sensing image scene classification method of viewpoint-aware dynamic routing capsule network
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN114596477A (en) Foggy day train fault detection method based on field self-adaption and attention mechanism
CN113807214A (en) Small target face recognition method based on deit attached network knowledge distillation
CN113449564B (en) Behavior image classification method based on human body local semantic knowledge
CN112800979A (en) Dynamic expression recognition method and system based on characterization flow embedded network
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians
CN112861848B (en) Visual relation detection method and system based on known action conditions
CN115496991A (en) Reference expression understanding method based on multi-scale cross-modal feature fusion
Zhang et al. Skeleton-based action recognition with attention and temporal graph convolutional network
CN109726690B (en) Multi-region description method for learner behavior image based on DenseCap network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant