CN113408343A - Classroom action recognition method based on double-scale space-time block mutual attention - Google Patents
Classroom action recognition method based on double-scale space-time block mutual attention Download PDFInfo
- Publication number
- CN113408343A CN113408343A CN202110518525.4A CN202110518525A CN113408343A CN 113408343 A CN113408343 A CN 113408343A CN 202110518525 A CN202110518525 A CN 202110518525A CN 113408343 A CN113408343 A CN 113408343A
- Authority
- CN
- China
- Prior art keywords
- scale
- space
- time
- student
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009471 action Effects 0.000 title claims abstract description 134
- 238000000034 method Methods 0.000 title claims abstract description 50
- 239000013598 vector Substances 0.000 claims abstract description 110
- 230000000903 blocking effect Effects 0.000 claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 84
- 238000013507 mapping Methods 0.000 claims description 21
- 238000001514 detection method Methods 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 13
- 238000000638 solvent extraction Methods 0.000 claims description 12
- 230000009977 dual effect Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 4
- 238000002372 labelling Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 abstract description 3
- 238000005457 optimization Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/20—Education
- G06Q50/205—Education administration or guidance
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Strategic Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Tourism & Hospitality (AREA)
- Educational Administration (AREA)
- Educational Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- General Business, Economics & Management (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a classroom action identification method based on double-scale space-time blocking mutual attention. Firstly, preprocessing high-definition classroom student video data to obtain a student action video frame sequence; then constructing an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module, sequentially obtaining a double-scale space-time feature representation, a double-scale classification vector and an action classification probability vector, and performing iterative optimization on the action recognition model by using a random gradient descent algorithm; and inputting the preprocessed new classroom video into the model to obtain the classes of the student actions. The method not only uses space-time attention to model a plurality of groups of space-time blocks to capture multi-scale space-time information of off-line and on-line classroom student videos, but also can depict student picture information with different scales through a scale mutual attention mechanism, thereby improving the student action recognition accuracy of classroom videos.
Description
Technical Field
The invention belongs to the technical field of video understanding and analysis, particularly relates to the technical field of motion recognition in video analysis, and relates to a classroom motion recognition method based on double-scale space-time block mutual attention.
Background
The traditional online class-leaving room is the main place for students to study and teachers to give lessons, and in recent years, online class-leaving especially in epidemic situations becomes a popular mode among teachers and students, and network live broadcast or advance recorded broadcast teaching is generally adopted. No matter the online class in the classroom or the online class using the network platform, the learning effect of the student is directly influenced by the quality of the teaching. The dilemma that is often encountered in practice is that teachers need to spend much energy on classroom discipline management in order to ensure the quality of classroom teaching, and cannot be put into teaching of teaching with full attention, which is particularly obvious in primary school classrooms. Therefore, a video action recognition technology is introduced to recognize actions of students in a classroom, the learning state of the students is sensed in real time, and an intelligent analysis report reflecting the classroom quality is provided. The classroom action recognition task takes the student action video frame sequence as input and outputs student action categories, and has wide application in scenes such as classroom teaching, self-service management, unmanned invigilation and the like. For example, in an unmanned invigilation environment, the classroom action recognition method can recognize the action of an examinee in real time, and the examinee can be investigated if a suspected cheating action occurs, so that the examination discipline is ensured. The main challenges are: it is difficult to unify the offline and online classroom motion recognition methods, there are students in different distances in the same video picture, and a large amount of calculation overhead is required for performing motion recognition on a plurality of students.
Currently, practical applications for classroom scene action recognition are few, and the existing method is mainly based on wearable equipment and skeleton information. However, the wearable device may cause discomfort to the student, which in turn affects the learning efficiency of the student; the method based on the skeleton information can identify fewer motion types, and the identification performance is very easily influenced by the shielding of objects such as tables, chairs and books. In addition, the traditional motion recognition method needs to encode the video frame into manual features (such as features of HOG3D, 3Dsurf and the like), but the manual features have great limitations and the extraction speed is low, so that the real-time requirement cannot be met. In recent years, an action recognition method with a Convolutional Neural Network (CNN) as a core can learn feature representation reflecting video latent semantic information end to end, and the accuracy of action recognition is greatly improved. In order to extract more effective visual features, a residual error network (ResNet) uses residual error connection to connect different layers of the network, so that the problems of overfitting, gradient disappearance or gradient explosion and the like generated during the training of a deeper neural network model are solved; a Non-Local Network (Non-Local Network) captures long-distance dependency relationship by using a Non-Local operation, establishes connection among pixel blocks at different distances of a video frame image through an attention mechanism, and mines semantic information among the pixel blocks. In addition, Transformer (Transformer) models, derived from the natural language processing domain, have recently been favored in the computer vision domain, where a lot of attention is paid to extracting the critical timing information of diversity in a sequence of video frames, so that the models can learn more discriminative feature representations.
The existing classroom action recognition technology still has many defects: firstly, designing a model separately for an offline classroom or an online classroom, and lacking a unified interface for fusing two types of classroom action identification methods; secondly, calculating space-time attention on all video frames in blocks when the features are extracted, neglecting the local characteristics of the space-time features to reduce the recognition rate, and calculating cost is overlarge when the video resolution is large; in addition, many methods only extract the space-time characteristics of single-scale blocks, and are difficult to adapt to the situation that the picture scales of individual students are different. In order to solve the problems of lack of a local space-time characteristic information exchange mechanism, adaptation to individual student pictures with different scales and the like, an efficient classroom action identification method which unifies an offline classroom and an online classroom and can improve student action identification accuracy is urgently needed.
Disclosure of Invention
The invention aims to provide a classroom action recognition method based on double-scale space-time block mutual attention aiming at the defects of the prior art, wherein a plurality of groups of space-time blocks are modeled by space-time attention so as to capture multi-scale space-time information of videos of students in offline and online classrooms, and the scale mutual attention is utilized to depict the picture information of the students in different scales so as to improve the recognition rate of classroom actions.
The method firstly acquires high-definition classroom student video data, and then sequentially performs the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence;
constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation;
constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector;
step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector;
step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged;
and (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of the action of the student.
Further, the step (1) is specifically:
(1-1) processing each high-definition classroom student video into a corresponding video frame sequence at a sampling rate of k frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 60k frames to obtain a high-definition classroom student image data set, wherein k is 15-30;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiThe frame sequence is represented by an RGB three-channel image with ith height H and width W, and T is the total frame number, namely T is 60 k.
Still further, the step (2) is specifically:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristicsWherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporalThe height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t momentAnd small scale block feature vectorsD represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrixAnd small scale spatio-temporal feature matrix[·,…,·]Representing a splicing operation; wherein the total number of large-scale space feature blocksTotal number of small-scale spatial feature blocksOutput dual scale spatiotemporal feature representation { Xl,Xs}。
Further, the step (3) is specifically:
(3-1) the space-time block mutual attention encoder is formed by connecting R space-time block mutual attention modules in series, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule; input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention moduleInput dual scale spatiotemporal feature tensorWherein the input large-scale space-time feature matrixInput small scale space-time feature matrix Andclassifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorWherein, the output large-scale mutual attention feature matrixOutput small-scale mutual attention feature matrix Andfor the output large scale classification vector and the small scale classification vector,andthe large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrixInput small scale space-time feature matrixLarge scale classification vectorAnd small scale classification vectorsObtained by random initialization;
when R is more than or equal to R and more than 1, the input is in double scaleEmpty feature tensorFor last space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorNamely, it is
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention moduleDouble-scale classification vector of (1)And
(3-3) the r-th double-scale space-time partitioning mutual attention moduleThe space-time block generation submodule of (a) will inputZ in (1)r,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform sizeAnd small scale feature mappingWherein the height dimensionWidth dimension
According to the height dimension hrWidth dimension wrTime dimension trWill be provided withPerforming space-time blocking to obtain the r group of large-scale space-time block feature tensorsWhere j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namelyr≥2:
Then will beDimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time blockWherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr;
Will be provided withAndsplicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Obtaining the updated small-scale space-time block feature tensor element by the same operationWherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2;
(3-4) the r-th double-scale space-time partitioning mutual attention moduleThe input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submoduleAndthe jth large-scale space-time block feature tensor elements of the r groupLinear mapping is carried out to obtain the query matrix of the target object at each attention headKey matrixSum matrixWherein, the attention head number a is 1, …, a is attentionTotal number of heads, dimension of each vector in the mapping matrixCalculating the corresponding multi-head space-time self-attention weight characteristics Wherein Softmax (-) is a normalized exponential function;
use ofLearnable parameterAnd calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Will be provided withDecomposing to obtain updated large-scale space-time block classification vectorAnd large scale space-time block space-time feature matrixMLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
(3-5) the r-th double-scale space-time partitioning mutual attention moduleThe input of the scale mutual attention submodule is the output of the space-time attention submoduleAndwherein the jth group of jth dual-scale space-time block classification vectors areAndthe space-time feature matrix of the dual-scale space-time block isAnd
classifying large scale space-time blocks into vectorsLinear mapping is carried out to obtain the query vectorClassifying large scale space-time blocks into vectorsWith small scale space-time block space-time characteristic matrixLinear mapping is carried out to obtain the key matrixSum matrixComputing multi-headed spatiotemporal self-attention weight features
Use ofLearnable parameterCalculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Thereby obtaining the r-th group of all large-scale space-time block classification vectorsLinear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrixSplicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
By the same operation, get smallScale classification vectorAnd small scale mutual attention feature matrix
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
Still further, the step (4) is specifically:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoderAndrespectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptronAnd a small scale score vector
Still further, the step (5) is specifically:
(5-1) the motion recognition model is composed of the dual-scale feature embedding module in the step (2), the dual-scale space-time block mutual attention encoder in the step (3) and the motion classification module in the step (4)
(5-2) motion recognition modelThe input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectorsAndinputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy lossOptimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,is a real mark, if the action category of the classroom student video belongs to b,otherwise
Still further, the step (6) is specifically:
(6-1) inputting the high-definition classroom student image data set marked with the student position bounding box into a target detection model YOLOv5 pre-trained on a COCO2017 data set, and iteratively training the model until the model converges to obtain a targetDetection model
(6-2) for a new classroom student video, the video frame sequence is obtained by the aid of the (1-1), and the first frame image is input into the target detection modelObtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2) Wherein phi is the serial number of the student, chi is the total number of the student,representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each studentInputting the motion recognition model obtained by training in the step (5)In the method, motion class probability vectors of phi-th students are obtainedAnd the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
The method of the invention utilizes a double-scale space-time block mutual attention encoder to identify the student action in the student video, and has the following characteristics: 1) different from the existing method which only designs for offline or online classes, the method of the invention firstly utilizes the target detection model to obtain the action frame sequence of students, and then further identifies the action category of each student, and can be generally used in the application scenes of the offline classes and the online classes; 2) different from the existing method for calculating the space-time attention of all video frame blocks during each step of feature extraction, the method of the invention uses a space-time block generation submodule and a space-time attention submodule to extract the space-time features in a plurality of groups of space-time blocks so as to realize local space-time feature information exchange and greatly reduce the calculation overhead; 3) the method of the invention uses two different sizes to block the video frame, and combines the scale mutual attention sub-module, so as to better extract the action information of the individual student pictures with different scales in the video.
The method is suitable for action recognition under the complex classroom scene with participation of a plurality of students and different picture scales of individual students, and has the advantages that: 1) the method unifies the action recognition methods of the offline classroom and the online classroom, and reduces the technical cost of applying the action recognition method to the two classes; 2) extracting features of a plurality of different space-time regions through a space-time block generation submodule and a space-time attention submodule, and fully considering the local characteristics of space-time features to obtain more accurate identification categories and improve the calculation efficiency; 3) the scale mutual attention submodule is used for learning the individual student pictures with different scales, and the space-time characteristics under the two scale blocks are fully fused to obtain better identification performance. The invention has the capability of local space-time characteristic learning and the capability of capturing individual student picture space characteristics with different scales, and can improve the student action recognition rate in practical application scenes such as classroom teaching supervision, self-service class management, unmanned invigilation and the like.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
A classroom action recognition method based on double-scale space-time block mutual attention is characterized by firstly sampling classroom student videos to obtain video frame sequences of the classroom students, obtaining a boundary frame of each student position by using a target detection model, further intercepting frame images in the boundary frame to obtain the student action video frame sequences, then constructing an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module, and finally judging classes of student actions by using the action recognition model. The method utilizes a target detection model to obtain a student action frame sequence to further identify action which can be commonly used in an off-line class and an on-line class, utilizes a space-time block generation submodule and a space-time attention submodule to extract space-time characteristics of a plurality of groups of space-time blocks so as to realize local space-time characteristic information exchange, and utilizes two block scales and scale mutual attention submodules to capture action information of different scales so as to adapt to the condition that the picture scales of students are different. The classroom action recognition system constructed in the mode can be uniformly deployed and applied to two classes, and meanwhile, the spatiotemporal information of student action video frames can be effectively extracted and student action categories can be efficiently recognized.
As shown in fig. 1, the method first obtains high definition classroom student video data, and then performs the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence; the method comprises the following steps:
(1-1) processing each online or offline high-definition classroom student video into a corresponding video frame sequence at a sampling rate of 25 frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 1500 frames per minute to obtain a high-definition classroom student image data set;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiRGB three-channel image with ith height H and width W in frame sequenceAnd T is the total frame number, namely T is 1500.
Constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation; the method comprises the following steps:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristicsWherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporalThe height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t momentAnd small scale block feature vectorsD represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrixAnd small scale spatio-temporal feature matrix[·,…,·]Representing a splicing operation; wherein, the large-scale space feature block assemblyNumber ofTotal number of small-scale spatial feature blocksOutput dual scale spatiotemporal feature representation { Xl,Xs}。
Constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector; the method comprises the following steps:
(3-1) the space-time block mutual attention encoder is formed by connecting R space-time block mutual attention modules in series, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule; input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention moduleInput dual scale spatiotemporal feature tensorWherein the input large-scale space-time feature matrixInput small scale space-time feature matrix Andclassifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorWherein, the output large-scale mutual attention feature matrixOutput small-scale mutual attention feature matrix Andfor the output large scale classification vector and the small scale classification vector,andthe large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrixInput small scale space-time feature matrixLarge scale classification vectorAnd small scale classification vectorsObtained by random initialization;
when R is not less thanWhen r is more than 1, the input double-scale space-time feature tensorFor last space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorNamely, it is
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention moduleDouble-scale classification vector of (1)And
(3-3) the r-th double-scale space-time partitioning mutual attention moduleThe space-time block generation submodule of (a) will inputZ in (1)r,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform sizeAnd small scale feature mappingWherein the height dimensionWidth dimension
According to the height dimension hrWidth dimension wrTime dimension trWill be provided withPerforming space-time blocking to obtain the r group of large-scale space-time block feature tensorsWhere j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namelyr≥2:
Then will beDimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time blockWherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr;
Will be provided withAndsplicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Obtaining the updated small-scale space-time block feature tensor element by the same operationWherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2;
(3-4) the r-th double-scale space-time partitioning mutual attention moduleThe input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submoduleAndthe jth large-scale space-time block feature tensor elements of the r groupLinear mapping is carried out to obtain the query matrix of the target object at each attention headKey matrixSum matrixWherein, attention headThe index a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix isCalculating the corresponding multi-head space-time self-attention weight characteristics Wherein Softmax (-) is a normalized exponential function;
use ofLearnable parameterAnd calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Will be provided withDecomposing to obtain updated large-scale space-time block classification vectorAnd large scale space-time block space-time feature matrixMLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
(3-5) the r-th double-scale space-time partitioning mutual attention moduleThe input of the scale mutual attention submodule is the output of the space-time attention submoduleAndwherein the jth group of jth dual-scale space-time block classification vectors areAndthe space-time feature matrix of the dual-scale space-time block isAnd
classifying large scale space-time blocks into vectorsLinear mapping is carried out to obtain the query vectorClassifying large scale space-time blocks into vectorsWith small scale space-time block space-time characteristic matrixLinear mapping is carried out to obtain the key matrixSum matrixComputing multi-headed spatiotemporal self-attention weight features
Use ofLearnable parameterCalculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Thereby obtaining the r-th group of all large-scale space-time block classification vectorsLinear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrixSplicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
Same operation is carried out to obtain small-scale classification vectorsAnd small scale mutual attention feature matrix
The output of the r space-time block mutual attention module is a dual-scale mutual attention feature tensor
Step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector; the method comprises the following steps:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoderAndrespectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptronAnd a small scale score vector
Step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged; the method comprises the following steps:
(5-1) embedding the dual-scale features of the step (2)Forming an action recognition model by the input module, the double-scale space-time blocked mutual attention encoder in the step (3) and the action classification module in the step (4)
(5-2) motion recognition modelThe input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectorsAndinputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy lossOptimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,is a real mark, if the action category of the classroom student video belongs to b,otherwise
Step (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of student action; the method comprises the following steps:
(6-1) inputting the high-definition classroom student image dataset marked with the student position bounding box into an open source target detection model YOLOv5 pre-trained on the existing COCO2017 dataset, and iteratively training the model until the model converges to obtain a target detection model
(6-2) for a new classroom student video, the video frame sequence is obtained by the aid of the (1-1), and the first frame image is input into the target detection modelObtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2) Wherein phi is the serial number of the student, chi is the total number of the student,representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each studentInputting the motion recognition model obtained by training in the step (5)In the method, motion class probability vectors of phi-th students are obtainedAnd the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
The embodiment described in this embodiment is only an example of the implementation form of the inventive concept, and the protection scope of the present invention should not be considered as being limited to the specific form set forth in the embodiment, and the protection scope of the present invention is also equivalent to the technical means that can be conceived by those skilled in the art according to the inventive concept.
Claims (7)
1. The classroom action identification method based on double-scale space-time block mutual attention is characterized in that the method firstly obtains high-definition classroom student video data and then carries out the following operations:
preprocessing high-definition classroom student video data to obtain a student action video frame sequence;
constructing a dual-scale feature embedding module, inputting a student action video frame sequence, and outputting a dual-scale space-time feature representation;
constructing a space-time block mutual attention encoder, inputting a dual-scale space-time feature representation, and outputting a dual-scale classification vector;
step (4) a classroom action classification module is constructed, input is a double-scale classification vector, and output is an action class probability vector;
step (5) performing iterative training on an action recognition model consisting of a double-scale feature embedding module, a space-time block mutual attention encoder and a classroom action classification module until the model is converged;
and (6) preprocessing a new classroom student video, inputting a first frame of image into a pre-trained target detection model to obtain a student boundary frame, acquiring a corresponding video frame sequence according to the student boundary frame, inputting the video frame sequence into a trained action recognition model, and finally outputting the class of the action of the student.
2. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 1, wherein the step (1) is specifically as follows:
(1-1) processing each high-definition classroom student video into a corresponding video frame sequence at a sampling rate of k frames per second, and labeling student position bounding boxes in the high-definition classroom student video frames at a time interval of 60k frames to obtain a high-definition classroom student image data set, wherein k is 15-30;
(1-2) intercepting 60k frames of images in the bounding box area by using a matrix indexing method of OpenCV (open source computer vision library) for each student position bounding box, and zooming the height and the width to the same resolution ratio to obtain a student action video frame sequence In the real number domain, the category number of the action is B, B is 1, … B, B is the total number of the action categories, fiThe frame sequence is represented by an RGB three-channel image with ith height H and width W, and T is the total frame number, namely T is 60 k.
3. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 2, wherein the step (2) is specifically as follows:
(2-1) the dual-scale feature embedding module consists of a three-dimensional convolution layer, a three-dimensional average pooling layer, a feature blocking operation and a linear embedding layer;
(2-2) inputting the student action video frame sequence V into the three-dimensional convolution layer to obtain space-time characteristics, and then putting the space-time characteristics into the three-dimensional average pooling layer to obtain pooled space-time characteristicsWherein h, w, c and t are respectively height dimension, width dimension, channel dimension and time sequence dimension of the pooled space-time characteristics;
(2-3) features of pooled spatio-temporalThe height dimension and the width dimension of the block are respectively subjected to feature blocking operation in L multiplied by L and S multiplied by S scales, and the features of each block are mapped through a linear embedding layer to obtain a large-scale block feature vector of the p block at the t momentAnd small scale block feature vectorsD represents the dimension of the feature vector, L and S are the size of the block scale, L is gamma S, and gamma is more than 0 and is a scale multiple;
respectively splicing the two block feature vectors to obtain a large-scale space-time feature matrixAnd small scale spatio-temporal feature matrix[·,…,·]Representing a splicing operation; wherein the total number of large-scale space feature blocksTotal number of small-scale spatial feature blocksOutput dual scale spatiotemporal feature representation { Xl,Xs}。
4. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 3, wherein the step (3) is specifically as follows:
(3-1) the space-time block mutual attention encoder is composed of R space-time block mutual attention modules in series connection, and each space-time block mutual attention module is composed of a space-time block generation submodule, a space-time attention submodule and a scale mutual attention submodule(ii) a Input is dual-scale space-time feature representation { Xl,Xs};
(3-2) the r-th space-time block mutual attention moduleInput dual scale spatiotemporal feature tensorWherein the input large-scale space-time feature matrixInput small scale space-time feature matrix Andclassifying the large-scale classification vector and the small-scale classification vector;
the r space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorWherein, the output large-scale mutual attention feature matrixOutput small-scale mutual attention feature matrix Andfor the output large scale classification vector and the small scale classification vector,andthe large-scale space-time characteristic matrix and the small-scale space-time characteristic matrix are output;
when r is 1, the input large-scale space-time characteristic matrixInput small scale space-time feature matrixLarge scale classification vectorAnd small scale classification vectorsObtained by random initialization;
when R is more than or equal to R and more than 1, the input double-scale space-time feature tensorFor last space-time block mutual attention moduleOutput dual-scale mutual attention feature tensorNamely, it is
The output of the space-time block mutual attention encoder is the output of the Rth space-time block mutual attention moduleDouble-scale classification vector of (1)And
(3-3) the r-th double-scale space-time partitioning mutual attention moduleThe space-time block generation submodule of (a) will inputZ in (1)r ,lAnd Zr,sLarge-scale feature mapping by respective regrouping to uniform sizeAnd small scale feature mappingWherein the height dimensionWidth dimension
According to the height dimension hrWidth dimension wrTime dimension trWill be provided withPerforming space-time blocking to obtain the r group of large-scale space-time block feature tensorsWhere j is the index subscript, Q, of the large-scale space-time blockrThe total number of the r-th group of large-scale space-time blocks meets the condition that:and the size of the r group of the time space blocks is lambda times of the r-1 group of the time space blocks, and lambda is more than 0, namely
Then will beDimension transformation is carried out to obtain a space-time characteristic matrix of the large-scale space-time blockWherein the total number n of the spatial feature blocks of the large-scale space-time blockl=hrwr;
Will be provided withAndsplicing to obtain the updated jth block large-scale space-time block feature tensor element of the r group
Obtaining the updated small-scale space-time block feature tensor element by the same operationWherein the total number n of the space characteristic blocks of the small-scale space-time blocks=hrwrγ2;
(3-4) the r-th double-scale space-time partitioning mutual attention moduleThe input of the spatio-temporal attention submodule being the output of the spatio-temporal block generation submoduleAndthe jth large-scale space-time block feature tensor elements of the r groupLinear mapping is carried out to obtain the query matrix of the target object at each attention headKey matrixSum matrixWherein, the attention head sequence number a is 1, …, a is the total number of attention heads, and the dimension of each vector in the mapping matrix isCalculating the corresponding multi-head space-time self-attention weight characteristics Wherein Softmax (-) is a normalized exponential function;
use ofLearnable parameterAnd calculating the large-scale space-time block space-time attention feature matrix by using the sum residual structure
Will be provided withDecomposing to obtain updated large-scale space-time block classification vectorAnd large scale space-time block space-time feature matrixMLP (. cndot.) denotes multilayer perceptron, LN (. cndot.) denotes layer normalization;
same operation is carried out to obtain a small-scale space-time block space-time attention feature matrix
(3-5) the r-th double-scale space-time partitioning mutual attention moduleThe input of the scale mutual attention submodule is the output of the space-time attention submoduleAndwherein the jth group of jth dual-scale space-time block classification vectors areAndthe space-time feature matrix of the dual-scale space-time block isAnd
classifying large scale space-time blocks into vectorsLinear mapping is carried out to obtain the query vectorClassifying large scale space-time blocks into vectorsWith small scale space-time block space-time characteristic matrixLinear mapping is carried out to obtain the key matrixSum matrixComputing multi-headed spatiotemporal self-attention weight features
Use ofLearnable parameterCalculating a sum residual structure to obtain an updated large-scale space-time block classification vector
Thereby obtaining the r-th group of all large-scale space-time block classification vectorsLinear mapping is carried out on the large-scale classification vector to obtain an updated large-scale classification vector
Splicing all large-scale space-time block space-time characteristic matrixes of the r group to obtain a large-scale space-time characteristic matrixSplicing the large-scale feature vector with the large-scale classification vector to obtain a large-scale mutual attention feature matrix
Same operation is carried out to obtain small-scale classification vectorsAnd small scale mutual attention feature matrix
5. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 4, wherein the step (4) is specifically as follows:
(4-1) the input of the classroom action classification module is a dual-scale classification vector output by a dual-scale space-time block mutual attention encoderAndrespectively calculating large-scale score vectors of action categories to which student actions belong by utilizing multilayer perceptronAnd a small scale score vector
6. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 5, wherein the step (5) is specifically as follows:
(5-1) the motion recognition model is composed of the dual-scale feature embedding module in the step (2), the dual-scale space-time block mutual attention encoder in the step (3) and the motion classification module in the step (4)
(5-2) motion recognition modelThe input of the system is a student action video frame sequence V, and a dual-scale space-time characteristic matrix X is calculated and output by a dual-scale characteristic embedding modulelAnd XsInputting the dual-scale space-time feature matrix into a dual-scale space-time partitioning mutual attention encoder, and outputting dual-scale classification vectorsAndinputting the double-scale classification vector into an action classification module, and outputting a probability vector of an action class to which the action of the student belongs;
(5-3) iteratively training the motion recognition model until the model converges: setting a loss function of the motion recognition model as a cross entropy lossOptimizing the motion recognition model by using a random gradient descent algorithm, and updating model parameters through inverse gradient propagation until loss convergence; wherein y isbIs the probability that the student action belongs to action class b,is a real mark, if the action category of the classroom student video belongs to b,otherwise
7. The classroom action recognition method based on double-scale space-time block mutual attention as claimed in claim 6, wherein step (6) is specifically:
(6-1) inputting the high-definition classroom student image data set marked with the student position bounding box into a target detection model YOLOv5 pre-trained on a COCO2017 data set, and iteratively training the model until the model converges to obtain the target detection model
(6-2) for a new classroom student video, the video frame sequence is obtained by the aid of the (1-1), and the first frame image is input into the target detection modelObtaining the position bounding box of each student, and obtaining the action video frame sequence of each student by using (1-2) Wherein phi is the serial number of the student, chi is the total number of the student,representing an RGB three-channel image with the ith height of H and the width of W in the phi-th student frame sequence;
(6-3) moving video frame sequence for each studentInputting the motion recognition model obtained by training in the step (5)In the method, motion class probability vectors of phi-th students are obtainedAnd the action category b 'corresponding to the maximum probability value is taken as the category to which the student action belongs, and b' is argmax (y)φ) Where argmax (·) is the index of the largest element in the vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110518525.4A CN113408343B (en) | 2021-05-12 | 2021-05-12 | Classroom action recognition method based on double-scale space-time block mutual attention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110518525.4A CN113408343B (en) | 2021-05-12 | 2021-05-12 | Classroom action recognition method based on double-scale space-time block mutual attention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113408343A true CN113408343A (en) | 2021-09-17 |
CN113408343B CN113408343B (en) | 2022-05-13 |
Family
ID=77678584
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110518525.4A Active CN113408343B (en) | 2021-05-12 | 2021-05-12 | Classroom action recognition method based on double-scale space-time block mutual attention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113408343B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113887610A (en) * | 2021-09-29 | 2022-01-04 | 内蒙古工业大学 | Pollen image classification method based on cross attention distillation transducer |
CN114373224A (en) * | 2021-12-28 | 2022-04-19 | 华南理工大学 | Fuzzy 3D skeleton action identification method and device based on self-supervision learning |
CN114648722A (en) * | 2022-04-07 | 2022-06-21 | 杭州电子科技大学 | Action identification method based on video multipath space-time characteristic network |
CN115273182A (en) * | 2022-07-13 | 2022-11-01 | 苏州工业职业技术学院 | Long video concentration degree prediction method and device |
CN117292209A (en) * | 2023-11-27 | 2023-12-26 | 之江实验室 | Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization |
CN118230259A (en) * | 2024-05-24 | 2024-06-21 | 辽宁人人畅享科技有限公司 | Practice teaching management system based on internet of things technology |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10089556B1 (en) * | 2017-06-12 | 2018-10-02 | Konica Minolta Laboratory U.S.A., Inc. | Self-attention deep neural network for action recognition in surveillance videos |
CN109389055A (en) * | 2018-09-21 | 2019-02-26 | 西安电子科技大学 | Video classification methods based on mixing convolution sum attention mechanism |
CN109902293A (en) * | 2019-01-30 | 2019-06-18 | 华南理工大学 | A kind of file classification method based on part with global mutually attention mechanism |
CN111027377A (en) * | 2019-10-30 | 2020-04-17 | 杭州电子科技大学 | Double-flow neural network time sequence action positioning method |
CN111611847A (en) * | 2020-04-01 | 2020-09-01 | 杭州电子科技大学 | Video motion detection method based on scale attention hole convolution network |
CN112183269A (en) * | 2020-09-18 | 2021-01-05 | 哈尔滨工业大学(深圳) | Target detection method and system suitable for intelligent video monitoring |
-
2021
- 2021-05-12 CN CN202110518525.4A patent/CN113408343B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10089556B1 (en) * | 2017-06-12 | 2018-10-02 | Konica Minolta Laboratory U.S.A., Inc. | Self-attention deep neural network for action recognition in surveillance videos |
CN109389055A (en) * | 2018-09-21 | 2019-02-26 | 西安电子科技大学 | Video classification methods based on mixing convolution sum attention mechanism |
CN109902293A (en) * | 2019-01-30 | 2019-06-18 | 华南理工大学 | A kind of file classification method based on part with global mutually attention mechanism |
CN111027377A (en) * | 2019-10-30 | 2020-04-17 | 杭州电子科技大学 | Double-flow neural network time sequence action positioning method |
CN111611847A (en) * | 2020-04-01 | 2020-09-01 | 杭州电子科技大学 | Video motion detection method based on scale attention hole convolution network |
CN112183269A (en) * | 2020-09-18 | 2021-01-05 | 哈尔滨工业大学(深圳) | Target detection method and system suitable for intelligent video monitoring |
Non-Patent Citations (2)
Title |
---|
王洁然: "基于高低层特征融合与卷积注意力机制的视频动作识别方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
田帅: "基于注意力机制的序列数据分类模型研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113887610A (en) * | 2021-09-29 | 2022-01-04 | 内蒙古工业大学 | Pollen image classification method based on cross attention distillation transducer |
CN113887610B (en) * | 2021-09-29 | 2024-02-02 | 内蒙古工业大学 | Pollen image classification method based on cross-attention distillation transducer |
CN114373224A (en) * | 2021-12-28 | 2022-04-19 | 华南理工大学 | Fuzzy 3D skeleton action identification method and device based on self-supervision learning |
CN114648722A (en) * | 2022-04-07 | 2022-06-21 | 杭州电子科技大学 | Action identification method based on video multipath space-time characteristic network |
CN114648722B (en) * | 2022-04-07 | 2023-07-18 | 杭州电子科技大学 | Motion recognition method based on video multipath space-time characteristic network |
CN115273182A (en) * | 2022-07-13 | 2022-11-01 | 苏州工业职业技术学院 | Long video concentration degree prediction method and device |
CN115273182B (en) * | 2022-07-13 | 2023-07-11 | 苏州工业职业技术学院 | Long video concentration prediction method and device |
CN117292209A (en) * | 2023-11-27 | 2023-12-26 | 之江实验室 | Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization |
CN117292209B (en) * | 2023-11-27 | 2024-04-05 | 之江实验室 | Video classification method and device based on space-time enhanced three-dimensional attention re-parameterization |
CN118230259A (en) * | 2024-05-24 | 2024-06-21 | 辽宁人人畅享科技有限公司 | Practice teaching management system based on internet of things technology |
CN118230259B (en) * | 2024-05-24 | 2024-07-16 | 辽宁人人畅享科技有限公司 | Practice teaching management system based on internet of things technology |
Also Published As
Publication number | Publication date |
---|---|
CN113408343B (en) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113408343B (en) | Classroom action recognition method based on double-scale space-time block mutual attention | |
CN111709409B (en) | Face living body detection method, device, equipment and medium | |
US11810366B1 (en) | Joint modeling method and apparatus for enhancing local features of pedestrians | |
EP3968179A1 (en) | Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device | |
Deng et al. | MVF-Net: A multi-view fusion network for event-based object classification | |
CN112036447B (en) | Zero-sample target detection system and learnable semantic and fixed semantic fusion method | |
CN115240121B (en) | Joint modeling method and device for enhancing local features of pedestrians | |
CN112001278A (en) | Crowd counting model based on structured knowledge distillation and method thereof | |
CN113033276B (en) | Behavior recognition method based on conversion module | |
CN111507275B (en) | Video data time sequence information extraction method and device based on deep learning | |
CN111738355A (en) | Image classification method and device with attention fused with mutual information and storage medium | |
CN112507920A (en) | Examination abnormal behavior identification method based on time displacement and attention mechanism | |
CN111368733B (en) | Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal | |
CN114780767A (en) | Large-scale image retrieval method and system based on deep convolutional neural network | |
Zhang et al. | Unsupervised depth estimation from monocular videos with hybrid geometric-refined loss and contextual attention | |
CN114187506B (en) | Remote sensing image scene classification method of viewpoint-aware dynamic routing capsule network | |
CN114663880A (en) | Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism | |
CN113536926A (en) | Human body action recognition method based on distance vector and multi-angle self-adaptive network | |
Zhang et al. | Skeleton-based action recognition with attention and temporal graph convolutional network | |
CN112861848B (en) | Visual relation detection method and system based on known action conditions | |
CN115496991A (en) | Reference expression understanding method based on multi-scale cross-modal feature fusion | |
CN109726690B (en) | Multi-region description method for learner behavior image based on DenseCap network | |
CN117557857B (en) | Detection network light weight method combining progressive guided distillation and structural reconstruction | |
CN117333799B (en) | Middle and primary school classroom behavior detection method and device based on deformable anchor frame | |
CN117372837A (en) | Cross-modal knowledge migration method based on knowledge distillation and unsupervised training modes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |