CN116597503A - Classroom behavior detection method based on space-time characteristics - Google Patents
Classroom behavior detection method based on space-time characteristics Download PDFInfo
- Publication number
- CN116597503A CN116597503A CN202310306774.6A CN202310306774A CN116597503A CN 116597503 A CN116597503 A CN 116597503A CN 202310306774 A CN202310306774 A CN 202310306774A CN 116597503 A CN116597503 A CN 116597503A
- Authority
- CN
- China
- Prior art keywords
- time
- space
- network
- frames
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 49
- 238000000034 method Methods 0.000 claims abstract description 7
- 238000012805 post-processing Methods 0.000 claims abstract description 7
- 230000006399 behavior Effects 0.000 claims description 37
- 239000011159 matrix material Substances 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 9
- 238000000354 decomposition reaction Methods 0.000 claims description 5
- 230000003287 optical effect Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000036962 time dependent Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 2
- 238000011176 pooling Methods 0.000 claims description 2
- 238000005096 rolling process Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 4
- 230000000875 corresponding effect Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of image processing and computer vision, and particularly relates to a classroom behavior detection method based on space-time characteristics, which comprises the following steps: selecting three key frames in a classroom video at intervals of K, carrying out gray-scale treatment, and then splicing according to RGB three channels to form three channel space-time images containing motion information; the method comprises the steps of taking a DarkNet-19 as a feature extractor to obtain the features of three-channel space-time images, and adding a full-connection layer output preliminary suggestion region; the three-channel space-time image of the preliminary suggested area is stacked into a longitudinal scanning line frame by frame to obtain an STMap; initializing the STMap, and obtaining a motion information fluctuation feature map through a space-time feature extractor; and (3) inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, and performing post-processing to obtain a detection result. The invention effectively reduces the calculated amount of the network and improves the accuracy of fine-grained student behavior detection.
Description
Technical Field
The invention belongs to the technical field of image processing and computer vision, and relates to a class behavior detection method based on space-time characteristics.
Background
In recent years, along with the continuous development of computer vision and artificial intelligence technology, china orderly advances the construction of intelligent campuses, and campuses featuring intelligent teaching, intelligent management and intelligent life are gradually constructed. The student class is the most critical ring in the intelligent campus, and the quality of the student class is determined by multiple aspects, including teaching design, class practice, teaching evaluation and the like, wherein the teaching evaluation is feedback on the teaching design and practice.
In conventional teaching evaluation, an evaluation teacher generally evaluates the teaching condition of the teacher and the teaching condition of the student in the back row. However, due to the limitation in the visual field, it is difficult to observe the specific state of the student in class, and thus such a manner is not comprehensive and objective. With the construction of intelligent campuses, most campuses are provided with cameras, and the automatic recognition of student classroom behaviors can be carried out by using an advanced computer vision technology, so that the completion of teaching evaluation is assisted.
Classroom behavior detection refers to automatically detecting and identifying behaviors of students in a classroom by using technologies such as computer vision, machine learning and the like. Traditional video behavior detection algorithms often complete work by using suggested areas or key frames, which reduces the complexity of the algorithm to some extent, but is often difficult to detect in classroom videos with scale change, occlusion and other problems.
The existing mainstream behavior detection method is mainly divided into a double-flow network and a network based on three-dimensional convolution, and is greatly improved in recent years. However, the optical flow in the double-flow network only comprises time sequence information in a short time, the long-time modeling effect of the current video is not ideal, and the student behavior categories with small inter-category differences cannot be effectively distinguished; although the three-dimensional convolution obtains great improvement on time sequence information extraction, a great amount of calculation resources are consumed, the model volume is large, the detection speed is low, and the three-dimensional convolution is difficult to apply to classroom scenes. How to reduce the network volume makes it possible to realize behavior detection in teaching scenes have important research significance.
Disclosure of Invention
In order to solve the technical problems, the invention provides a class behavior detection method based on space-time characteristics, which comprises the following steps:
s1, selecting three frames of key frames from classroom video to be subjected to behavior detection, carrying out gray-scale treatment on the key frames, and then splicing the key frames according to RGB three channels to obtain three-channel space-time images containing motion information;
s2, taking a DarkNet-19 network as a feature extractor, extracting features of different scales of three-channel space-time images through repeated rolling and pooling operations of the DarkNet-19 network, eliminating irrelevant information, finally compressing the extracted features into a one-dimensional vector, transmitting the one-dimensional vector to a full-connection layer, and obtaining a preliminary suggestion region through a softmax function;
s3, stacking longitudinal scanning lines on the preliminary suggested area frame by frame to form a two-dimensional matrix, so as to obtain a space-time mapping map STMap;
s4, initializing a space-time mapping map STMap, decomposing by using an improved DMD algorithm, and inputting a decomposition result into an input space-time feature extractor to obtain a motion information fluctuation feature map;
s5, inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, performing post-processing, and outputting a behavior detection result of a target in the video.
The invention has the beneficial effects that: according to the invention, three-channel space-time images are adopted, the time sequence information of classroom videos is fused into a two-dimensional image, a preliminary suggested area is obtained through a small feature extractor DarkNet-19 and a full-connection layer, the calculated amount is effectively reduced, the suggested area information of the three-channel space-time images can be effectively divided through generating STMaps, the feature extraction effect of the network is improved, and meanwhile, the time sequence information in the STMaps can be obtained through the space-time feature extractor, so that the time sequence features of student targets are obtained, and the method has a better identification effect on similar behaviors.
Drawings
FIG. 1 is a flow chart of a class behavior detection method based on spatiotemporal features of the present invention;
FIG. 2 is a schematic diagram of three-channel spatiotemporal image generation of the present invention;
fig. 3 is a schematic diagram of STMap generation according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a class behavior detection method based on space-time characteristics, as shown in fig. 1, which is a logic frame diagram of the embodiment, and mainly comprises the steps of selecting three key frames with K as intervals in a classroom video, carrying out gray processing on the key frames, and then splicing the key frames according to RGB three channels to form a three-channel space-time image containing motion information; the method comprises the steps of taking a small DarkNet-19 as a feature extractor to obtain the features of three-channel space-time images, and adding a full-connection layer output preliminary suggestion region; the three-channel space-time image of the preliminary suggested area is stacked into a two-dimensional matrix by the longitudinal scanning lines frame by frame, so that an STMap is obtained; initializing an STMap, decomposing by using an improved DMD algorithm, and inputting a decomposition result into a space-time feature extractor to obtain a motion information fluctuation feature map; and (3) inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, performing post-processing, and outputting a behavior detection result of students in the video.
S1, selecting three key frames in a classroom video at intervals of K, carrying out gray-scale treatment on the key frames, and then splicing three channels according to RGB (red, green and blue) to form three-channel space-time images containing motion information;
fig. 2 is a schematic diagram of three channel space-time image generation in this embodiment, as shown in fig. 2, video frames of every interval K (k=0, 1, …, n) frames in a video are set as key frames, every three frames of key frames are a group, the key frames are grayed to form a single channel gray scale image, and the three frames of key frames are respectively taken as R, G, B three channel images according to time sequence to be spliced to form a three channel space-time image containing motion information, and the three channel space-time image contains a virtual image formed by three frames of target motion information.
S2, taking a small DarkNet-19 as a feature extractor to obtain the features of three-channel space-time images, and adding a full-connection layer output preliminary suggestion region;
feature extraction is performed through a small feature extractor DarkNet-19, so that the calculation amount of the network is reduced. And inputting the obtained characteristic information into the full-connection layer, and obtaining a preliminary suggested area of the three-channel space-time image by using a softmax function, thereby reducing the computational burden caused by excessive pixels.
Obtaining a preliminary suggested region by a softmax function, including:
wherein D represents a preliminary suggested region, z, obtained by a softmax function i Representing a one-dimensional vector into which features of different dimensions are compressed, c representing the dimensions of the features.
S3, stacking longitudinal scanning lines on three-channel space-time images of the preliminary suggested area frame by frame to form a two-dimensional matrix, so that an STMap is obtained;
FIG. 3 is a schematic diagram of STMap generation according to the present embodiment, as shown in FIG. 3, by scanning a longitudinal scan line (l 1 ,l 2 ,l 3 ) According to the sequence of three key frames in R, G, B channel, stacking frame by frame to form an S n×3 Where n represents the number of pixels per scan line, 3 represents 3 keyframes in three channels, and each scan line represents the motion state of the object in the current keyframe.
S4, initializing the STMap, decomposing by using an improved DMD algorithm, and inputting a decomposition result into an input space-time feature extractor to obtain a motion information fluctuation feature map;
s41, reflecting the change of the scanning line pixels in the STMap by using a linear time dependent operator A, wherein the scanning line changes with time to be:
l x+1 =Al x
wherein l x For the state of the current scan line, l x+1 For the state of the latter scanning line, the correlation between the two is set, A is a linear parameter, and the variation of STMap can be expressed as S i+1 =AS i 。
S42, the STMap has a low-rank structure, and the background pixels are highly correlated between adjacent columns, so that the STMap can be represented by using a combination of eigenvectors and eigenvalues of the linear time dependent operator a, which is:
S=∑ i φ i b i λ i
wherein phi is i 、λ i Feature vector and feature matrix of a, b respectively i Is the coordinates of S in the case of the corresponding feature vector.
Reconstructing the matrix A by using a DMD algorithm, inputting each parameter of the reconstruction result into an MLP network for training, adaptively obtaining the low-order rank of the matrix A, and attaching the low-order rank to the dynamic track of the original video sequence target, namely:
||S i+1 -AS i || 2 →min
the STMap is decomposed into a low rank background portion and a sparse foreground portion.
S43, the space-time feature extractor is based on the UNet model, and a light coding module is used for replacing an original coder in the UNet model, so that semantic gaps between the coder and the decoder are reduced. Inputting a low-rank background part and a sparse foreground part of the STMap into an improved space-time feature extractor UNet model, performing downsampling by using a multi-layer convolution to extract features, using correlation calculation to realize corresponding matching relations of different features, performing multiple upsampling operations by using a decoding module according to the features realizing the corresponding matching relations of the different features to obtain predicted optical flows corresponding to the different features, and fusing the output predicted optical flows and the features of corresponding coding layers to obtain semantic information from the different layers, thereby obtaining a motion information fluctuation feature map.
S5, inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, performing post-processing, and outputting a behavior detection result of students in the video.
S51, inputting the obtained motion information fluctuation feature map into a convolutional neural network of a target detection network YOLO V5, generating a series of anchor frames for positioning and identifying targets on the motion information fluctuation feature map, calculating the confidence coefficient of each anchor frame, setting a threshold value, and filtering anchor frames with the confidence coefficient lower than the set threshold value to obtain candidate frames;
and setting the threshold value to 0.5, and discarding the anchor frame with the confidence coefficient lower than 0.5 if the anchor frame contains the target information.
The confidence level includes:
where IOU represents confidence, area (r g ) Represents the prediction frame region, area (r) n ) Representing the real box area.
S52: the candidate frames are further screened through an NMS algorithm, a prediction boundary frame with highest confidence coefficient is selected from all the candidate frames to serve as a reference, then other boundary frames with confidence coefficient exceeding a preset threshold value are removed, a boundary frame with second highest confidence coefficient is selected from all the candidate frames to serve as a reference, all other boundary frames with confidence coefficient exceeding the preset threshold value are removed, the operation is repeated until all the prediction frames are used as the reference, a final detection frame is obtained, behavior information in the detection frame is judged, and the position of a target, behavior category information and behavior starting time are obtained.
The threshold of NMS can be adjusted according to the actual scenario, with the reference set to 0.35.
Training a convolutional neural network of a target detection network YOLO V5, setting the batch size of the network training to be 16, and setting the total iteration number to be100 epoches with a learning rate of 10 -3 The weight decay factor is set to 0.0005 and the momentum factor is set to 0.9.
In the training process of the network, the space-time cross ratio of the link channel and the group trunk, namely the cross ratio of the region and the starting time where the behavior occurs and the group trunk is automatically calculated. The matching principle of the link channel is as follows: and for each group trunk in the segment, finding a link channel with the largest space-time cross ratio, matching the link channel with the link channel, and judging the link channel as a positive sample, otherwise, if one link channel is not matched with any group trunk, the link channel is matched with the background, and judging the link channel as a negative sample. For the rest unmatched link channels, if the space-time cross ratio of a group trunk is greater than a threshold value of 0.5, the link channel is also matched with the group trunk.
The loss functions of the network include regression loss and classification loss:
wherein N is the number of positive samples of a link channel, c is a category confidence predictive value, L is a predictive value, g is a position parameter of a group trunk, alpha is a weight coefficient, and is set to 1, x represents a result output by a network, and L conf() Representing the classification loss, L loc() Representing regression loss.
In summary, three frames of key frames are selected at intervals of K in the video, and are spliced according to RGB three channels after gray processing, so that three-channel space-time images containing motion information are formed; the method comprises the steps of taking a small DarkNet-19 as a feature extractor to obtain the features of three-channel space-time images, and adding a full-connection layer output preliminary suggestion region; the three-channel space-time image of the preliminary suggested area is stacked into a two-dimensional matrix by the longitudinal scanning lines frame by frame, so that an STMap is obtained; initializing an STMap, decomposing by using an improved DMD algorithm, and inputting a decomposition result into an input space-time feature extractor to obtain a motion information fluctuation feature map; and (3) inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, performing post-processing, and outputting a behavior detection result of students in the video. The calculated amount of the network is effectively reduced, and the accuracy of fine granularity behavior detection is improved.
In the description of the present invention, it should be understood that the terms "coaxial," "bottom," "one end," "top," "middle," "another end," "upper," "one side," "top," "inner," "outer," "front," "center," "two ends," etc. indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the invention.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "configured," "connected," "secured," "rotated," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intermediaries, or in communication with each other or in interaction with each other, unless explicitly defined otherwise, the meaning of the terms described above in this application will be understood by those of ordinary skill in the art in view of the specific circumstances.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (8)
1. A class behavior detection method based on space-time characteristics is characterized by comprising the following steps:
s1, selecting three frames of key frames from classroom video to be subjected to behavior detection, carrying out gray-scale treatment on the key frames, and then splicing the key frames according to RGB three channels to obtain three-channel space-time images containing motion information;
s2, taking a DarkNet-19 network as a feature extractor, extracting features of different scales of three-channel space-time images through repeated rolling and pooling operations of the DarkNet-19 network, eliminating irrelevant information, finally compressing the extracted features into a one-dimensional vector, transmitting the one-dimensional vector to a full-connection layer, and obtaining a preliminary suggestion region through a softmax function;
s3, stacking longitudinal scanning lines on the preliminary suggested area frame by frame to form a two-dimensional matrix, so as to obtain a space-time mapping map STMap;
s4, initializing a space-time mapping map STMap, decomposing by using an improved DMD algorithm, and inputting a decomposition result into an input space-time feature extractor to obtain a motion information fluctuation feature map;
s5, inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, performing post-processing, and outputting a behavior detection result of students in the video.
2. The classroom behavior detection method based on space-time features of claim 1, wherein the key frame comprises: video frames of video at intervals of K are set as key frames, where k=0, 1, …, n.
3. The classroom behavior detection method based on space-time characteristics according to claim 1, wherein the preliminary advice region is obtained by a softmax function, comprising:
where D represents the preliminary suggested region obtained by the softmax function, zi represents the one-dimensional vector compressed by features of different scales, and c represents the scale of the feature.
4. The method for detecting classroom behavior based on space-time features according to claim 1, wherein the step of stacking longitudinal scan lines on each other to form a two-dimensional matrix in the preliminary suggested area to obtain an STMap comprises:
by stacking the longitudinal scan lines, each representing the motion state of the object in the current keyframe, frame by frame according to the order of the three keyframes in the R, G, B channel, a two-dimensional matrix is formed, wherein 3 represents the number of pixels of each scan line and 3 represents the 3 keyframes in the three channels.
5. The classroom behavior detection method based on spatiotemporal features of claim 1, wherein said spatiotemporal feature extractor comprises:
and taking the UNet model as a basic space-time feature extractor, using a light coding module to replace an original coder in the UNet model, obtaining an improved UNet model, and taking the improved UNet model as a final space-time feature extractor.
6. The classroom behavior detection method based on space-time characteristics according to claim 1, wherein the step S4 specifically comprises:
s41: the linear time dependent operator A is used for reflecting the change of scanning line pixels in the STMap, so that time sequence information is extracted, and time sequence characteristics of a target are obtained;
s42: according to the time sequence characteristics of the target, the STMap is expressed as a combined matrix of characteristic vectors and characteristic values of the linear time dependence operator A, a low-order rank of the linear time dependence operator A is searched by using an improved DMD algorithm, the low-order rank is attached to a dynamic track of the target in an original video sequence, and the STMap is decomposed into a background part with a low rank and a sparse foreground part;
s43: inputting a low-rank background part and a sparse foreground part of the STMap into an improved space-time feature extractor UNet model, performing downsampling by using a multi-layer convolution to extract features, using correlation calculation to realize corresponding matching relations of different features, performing multiple upsampling operations by using a decoding module according to the features realizing the corresponding matching relations of the different features to obtain predicted optical flows corresponding to the different features, and fusing the output predicted optical flows and the features of corresponding coding layers to obtain semantic information from the different layers, thereby obtaining a motion information fluctuation feature map.
7. The classroom behavior detection method based on space-time characteristics according to claim 1, wherein inputting the motion information fluctuation feature map into a network for detection and post-processing, outputting the behavior detection result of the target in the video, comprises:
s51, inputting the obtained motion information fluctuation feature map into a convolutional neural network of a target detection network YOLO V5, generating a series of anchor frames for positioning and identifying targets on the motion information fluctuation feature map, calculating the confidence coefficient of each anchor frame, setting a threshold value, and filtering anchor frames with the confidence coefficient lower than the set threshold value to obtain candidate frames;
s52: the candidate frames are further screened through an NMS algorithm, a prediction boundary frame with highest confidence coefficient is selected from all the candidate frames to serve as a reference, then other boundary frames with confidence coefficient exceeding a preset threshold value are removed, a boundary frame with second highest confidence coefficient is selected from all the candidate frames to serve as a reference, all other boundary frames with confidence coefficient exceeding the preset threshold value are removed, the operation is repeated until all the prediction frames are used as the reference, a final detection frame is obtained, behavior information in the detection frame is judged, and the position of a target, behavior category information and behavior starting time are obtained.
8. The method for detecting class behaviors based on spatiotemporal features of claim 7, wherein the confidence level comprises:
where IOU represents confidence, area (r g ) Represents the prediction frame region, area (r) n ) Representing the real box area.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310306774.6A CN116597503A (en) | 2023-03-27 | 2023-03-27 | Classroom behavior detection method based on space-time characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310306774.6A CN116597503A (en) | 2023-03-27 | 2023-03-27 | Classroom behavior detection method based on space-time characteristics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116597503A true CN116597503A (en) | 2023-08-15 |
Family
ID=87603295
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310306774.6A Pending CN116597503A (en) | 2023-03-27 | 2023-03-27 | Classroom behavior detection method based on space-time characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116597503A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118155294A (en) * | 2024-05-11 | 2024-06-07 | 武汉纺织大学 | Double-flow network classroom behavior identification method based on space-time attention |
-
2023
- 2023-03-27 CN CN202310306774.6A patent/CN116597503A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118155294A (en) * | 2024-05-11 | 2024-06-07 | 武汉纺织大学 | Double-flow network classroom behavior identification method based on space-time attention |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110111366B (en) | End-to-end optical flow estimation method based on multistage loss | |
CN111274869B (en) | Method for classifying hyperspectral images based on parallel attention mechanism residual error network | |
CN111738363B (en) | Alzheimer disease classification method based on improved 3D CNN network | |
CN110175986B (en) | Stereo image visual saliency detection method based on convolutional neural network | |
CN110059728B (en) | RGB-D image visual saliency detection method based on attention model | |
CN112651978A (en) | Sublingual microcirculation image segmentation method and device, electronic equipment and storage medium | |
CN112101262B (en) | Multi-feature fusion sign language recognition method and network model | |
CN113450313B (en) | Image significance visualization method based on regional contrast learning | |
CN108090472A (en) | Pedestrian based on multichannel uniformity feature recognition methods and its system again | |
CN115512103A (en) | Multi-scale fusion remote sensing image semantic segmentation method and system | |
CN110827265B (en) | Image anomaly detection method based on deep learning | |
CN116468645B (en) | Antagonistic hyperspectral multispectral remote sensing fusion method | |
CN113807356B (en) | End-to-end low-visibility image semantic segmentation method | |
CN114724155A (en) | Scene text detection method, system and equipment based on deep convolutional neural network | |
CN113449691A (en) | Human shape recognition system and method based on non-local attention mechanism | |
CN113343974A (en) | Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement | |
CN114463759A (en) | Lightweight character detection method and device based on anchor-frame-free algorithm | |
CN112418032A (en) | Human behavior recognition method and device, electronic equipment and storage medium | |
CN113159067A (en) | Fine-grained image identification method and device based on multi-grained local feature soft association aggregation | |
CN114170657A (en) | Facial emotion recognition method integrating attention mechanism and high-order feature representation | |
CN116205962A (en) | Monocular depth estimation method and system based on complete context information | |
CN116597503A (en) | Classroom behavior detection method based on space-time characteristics | |
CN115862103A (en) | Method and system for identifying face of thumbnail | |
CN114494786A (en) | Fine-grained image classification method based on multilayer coordination convolutional neural network | |
CN111767842B (en) | Micro-expression type discrimination method based on transfer learning and self-encoder data enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |