CN116597503A

CN116597503A - Classroom behavior detection method based on space-time characteristics

Info

Publication number: CN116597503A
Application number: CN202310306774.6A
Authority: CN
Inventors: 高陈强; 朱常杰; 陈欣悦
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-08-15

Abstract

The invention belongs to the technical field of image processing and computer vision, and particularly relates to a classroom behavior detection method based on space-time characteristics, which comprises the following steps: selecting three key frames in a classroom video at intervals of K, carrying out gray-scale treatment, and then splicing according to RGB three channels to form three channel space-time images containing motion information; the method comprises the steps of taking a DarkNet-19 as a feature extractor to obtain the features of three-channel space-time images, and adding a full-connection layer output preliminary suggestion region; the three-channel space-time image of the preliminary suggested area is stacked into a longitudinal scanning line frame by frame to obtain an STMap; initializing the STMap, and obtaining a motion information fluctuation feature map through a space-time feature extractor; and (3) inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, and performing post-processing to obtain a detection result. The invention effectively reduces the calculated amount of the network and improves the accuracy of fine-grained student behavior detection.

Description

Classroom behavior detection method based on space-time characteristics

Technical Field

The invention belongs to the technical field of image processing and computer vision, and relates to a class behavior detection method based on space-time characteristics.

Background

In recent years, along with the continuous development of computer vision and artificial intelligence technology, china orderly advances the construction of intelligent campuses, and campuses featuring intelligent teaching, intelligent management and intelligent life are gradually constructed. The student class is the most critical ring in the intelligent campus, and the quality of the student class is determined by multiple aspects, including teaching design, class practice, teaching evaluation and the like, wherein the teaching evaluation is feedback on the teaching design and practice.

In conventional teaching evaluation, an evaluation teacher generally evaluates the teaching condition of the teacher and the teaching condition of the student in the back row. However, due to the limitation in the visual field, it is difficult to observe the specific state of the student in class, and thus such a manner is not comprehensive and objective. With the construction of intelligent campuses, most campuses are provided with cameras, and the automatic recognition of student classroom behaviors can be carried out by using an advanced computer vision technology, so that the completion of teaching evaluation is assisted.

Classroom behavior detection refers to automatically detecting and identifying behaviors of students in a classroom by using technologies such as computer vision, machine learning and the like. Traditional video behavior detection algorithms often complete work by using suggested areas or key frames, which reduces the complexity of the algorithm to some extent, but is often difficult to detect in classroom videos with scale change, occlusion and other problems.

The existing mainstream behavior detection method is mainly divided into a double-flow network and a network based on three-dimensional convolution, and is greatly improved in recent years. However, the optical flow in the double-flow network only comprises time sequence information in a short time, the long-time modeling effect of the current video is not ideal, and the student behavior categories with small inter-category differences cannot be effectively distinguished; although the three-dimensional convolution obtains great improvement on time sequence information extraction, a great amount of calculation resources are consumed, the model volume is large, the detection speed is low, and the three-dimensional convolution is difficult to apply to classroom scenes. How to reduce the network volume makes it possible to realize behavior detection in teaching scenes have important research significance.

Disclosure of Invention

In order to solve the technical problems, the invention provides a class behavior detection method based on space-time characteristics, which comprises the following steps:

s1, selecting three frames of key frames from classroom video to be subjected to behavior detection, carrying out gray-scale treatment on the key frames, and then splicing the key frames according to RGB three channels to obtain three-channel space-time images containing motion information;

s2, taking a DarkNet-19 network as a feature extractor, extracting features of different scales of three-channel space-time images through repeated rolling and pooling operations of the DarkNet-19 network, eliminating irrelevant information, finally compressing the extracted features into a one-dimensional vector, transmitting the one-dimensional vector to a full-connection layer, and obtaining a preliminary suggestion region through a softmax function;

s3, stacking longitudinal scanning lines on the preliminary suggested area frame by frame to form a two-dimensional matrix, so as to obtain a space-time mapping map STMap;

s4, initializing a space-time mapping map STMap, decomposing by using an improved DMD algorithm, and inputting a decomposition result into an input space-time feature extractor to obtain a motion information fluctuation feature map;

s5, inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, performing post-processing, and outputting a behavior detection result of a target in the video.

The invention has the beneficial effects that: according to the invention, three-channel space-time images are adopted, the time sequence information of classroom videos is fused into a two-dimensional image, a preliminary suggested area is obtained through a small feature extractor DarkNet-19 and a full-connection layer, the calculated amount is effectively reduced, the suggested area information of the three-channel space-time images can be effectively divided through generating STMaps, the feature extraction effect of the network is improved, and meanwhile, the time sequence information in the STMaps can be obtained through the space-time feature extractor, so that the time sequence features of student targets are obtained, and the method has a better identification effect on similar behaviors.

Drawings

FIG. 1 is a flow chart of a class behavior detection method based on spatiotemporal features of the present invention;

FIG. 2 is a schematic diagram of three-channel spatiotemporal image generation of the present invention;

fig. 3 is a schematic diagram of STMap generation according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a class behavior detection method based on space-time characteristics, as shown in fig. 1, which is a logic frame diagram of the embodiment, and mainly comprises the steps of selecting three key frames with K as intervals in a classroom video, carrying out gray processing on the key frames, and then splicing the key frames according to RGB three channels to form a three-channel space-time image containing motion information; the method comprises the steps of taking a small DarkNet-19 as a feature extractor to obtain the features of three-channel space-time images, and adding a full-connection layer output preliminary suggestion region; the three-channel space-time image of the preliminary suggested area is stacked into a two-dimensional matrix by the longitudinal scanning lines frame by frame, so that an STMap is obtained; initializing an STMap, decomposing by using an improved DMD algorithm, and inputting a decomposition result into a space-time feature extractor to obtain a motion information fluctuation feature map; and (3) inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, performing post-processing, and outputting a behavior detection result of students in the video.

S1, selecting three key frames in a classroom video at intervals of K, carrying out gray-scale treatment on the key frames, and then splicing three channels according to RGB (red, green and blue) to form three-channel space-time images containing motion information;

fig. 2 is a schematic diagram of three channel space-time image generation in this embodiment, as shown in fig. 2, video frames of every interval K (k=0, 1, …, n) frames in a video are set as key frames, every three frames of key frames are a group, the key frames are grayed to form a single channel gray scale image, and the three frames of key frames are respectively taken as R, G, B three channel images according to time sequence to be spliced to form a three channel space-time image containing motion information, and the three channel space-time image contains a virtual image formed by three frames of target motion information.

S2, taking a small DarkNet-19 as a feature extractor to obtain the features of three-channel space-time images, and adding a full-connection layer output preliminary suggestion region;

feature extraction is performed through a small feature extractor DarkNet-19, so that the calculation amount of the network is reduced. And inputting the obtained characteristic information into the full-connection layer, and obtaining a preliminary suggested area of the three-channel space-time image by using a softmax function, thereby reducing the computational burden caused by excessive pixels.

Obtaining a preliminary suggested region by a softmax function, including:

wherein D represents a preliminary suggested region, z, obtained by a softmax function _i Representing a one-dimensional vector into which features of different dimensions are compressed, c representing the dimensions of the features.

S3, stacking longitudinal scanning lines on three-channel space-time images of the preliminary suggested area frame by frame to form a two-dimensional matrix, so that an STMap is obtained;

FIG. 3 is a schematic diagram of STMap generation according to the present embodiment, as shown in FIG. 3, by scanning a longitudinal scan line (l ₁ ，l ₂ ，l ₃ ) According to the sequence of three key frames in R, G, B channel, stacking frame by frame to form an S _n×3 Where n represents the number of pixels per scan line, 3 represents 3 keyframes in three channels, and each scan line represents the motion state of the object in the current keyframe.

S4, initializing the STMap, decomposing by using an improved DMD algorithm, and inputting a decomposition result into an input space-time feature extractor to obtain a motion information fluctuation feature map;

s41, reflecting the change of the scanning line pixels in the STMap by using a linear time dependent operator A, wherein the scanning line changes with time to be:

l _x+1 ＝Al _x

wherein l _x For the state of the current scan line, l _x+1 For the state of the latter scanning line, the correlation between the two is set, A is a linear parameter, and the variation of STMap can be expressed as S _i+1 ＝AS _i 。

S42, the STMap has a low-rank structure, and the background pixels are highly correlated between adjacent columns, so that the STMap can be represented by using a combination of eigenvectors and eigenvalues of the linear time dependent operator a, which is:

S＝∑ _i φ _i b _i λ _i

wherein phi is _i 、λ _i Feature vector and feature matrix of a, b respectively _i Is the coordinates of S in the case of the corresponding feature vector.

Reconstructing the matrix A by using a DMD algorithm, inputting each parameter of the reconstruction result into an MLP network for training, adaptively obtaining the low-order rank of the matrix A, and attaching the low-order rank to the dynamic track of the original video sequence target, namely:

||S _i+1 -AS _i || ₂ →min

the STMap is decomposed into a low rank background portion and a sparse foreground portion.

S43, the space-time feature extractor is based on the UNet model, and a light coding module is used for replacing an original coder in the UNet model, so that semantic gaps between the coder and the decoder are reduced. Inputting a low-rank background part and a sparse foreground part of the STMap into an improved space-time feature extractor UNet model, performing downsampling by using a multi-layer convolution to extract features, using correlation calculation to realize corresponding matching relations of different features, performing multiple upsampling operations by using a decoding module according to the features realizing the corresponding matching relations of the different features to obtain predicted optical flows corresponding to the different features, and fusing the output predicted optical flows and the features of corresponding coding layers to obtain semantic information from the different layers, thereby obtaining a motion information fluctuation feature map.

S5, inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, performing post-processing, and outputting a behavior detection result of students in the video.

S51, inputting the obtained motion information fluctuation feature map into a convolutional neural network of a target detection network YOLO V5, generating a series of anchor frames for positioning and identifying targets on the motion information fluctuation feature map, calculating the confidence coefficient of each anchor frame, setting a threshold value, and filtering anchor frames with the confidence coefficient lower than the set threshold value to obtain candidate frames;

and setting the threshold value to 0.5, and discarding the anchor frame with the confidence coefficient lower than 0.5 if the anchor frame contains the target information.

The confidence level includes:

where IOU represents confidence, area (r _g ) Represents the prediction frame region, area (r) _n ) Representing the real box area.

S52: the candidate frames are further screened through an NMS algorithm, a prediction boundary frame with highest confidence coefficient is selected from all the candidate frames to serve as a reference, then other boundary frames with confidence coefficient exceeding a preset threshold value are removed, a boundary frame with second highest confidence coefficient is selected from all the candidate frames to serve as a reference, all other boundary frames with confidence coefficient exceeding the preset threshold value are removed, the operation is repeated until all the prediction frames are used as the reference, a final detection frame is obtained, behavior information in the detection frame is judged, and the position of a target, behavior category information and behavior starting time are obtained.

The threshold of NMS can be adjusted according to the actual scenario, with the reference set to 0.35.

Training a convolutional neural network of a target detection network YOLO V5, setting the batch size of the network training to be 16, and setting the total iteration number to be100 epoches with a learning rate of 10 ^-3 The weight decay factor is set to 0.0005 and the momentum factor is set to 0.9.

In the training process of the network, the space-time cross ratio of the link channel and the group trunk, namely the cross ratio of the region and the starting time where the behavior occurs and the group trunk is automatically calculated. The matching principle of the link channel is as follows: and for each group trunk in the segment, finding a link channel with the largest space-time cross ratio, matching the link channel with the link channel, and judging the link channel as a positive sample, otherwise, if one link channel is not matched with any group trunk, the link channel is matched with the background, and judging the link channel as a negative sample. For the rest unmatched link channels, if the space-time cross ratio of a group trunk is greater than a threshold value of 0.5, the link channel is also matched with the group trunk.

The loss functions of the network include regression loss and classification loss:

wherein N is the number of positive samples of a link channel, c is a category confidence predictive value, L is a predictive value, g is a position parameter of a group trunk, alpha is a weight coefficient, and is set to 1, x represents a result output by a network, and L _conf() Representing the classification loss, L _loc() Representing regression loss.

In summary, three frames of key frames are selected at intervals of K in the video, and are spliced according to RGB three channels after gray processing, so that three-channel space-time images containing motion information are formed; the method comprises the steps of taking a small DarkNet-19 as a feature extractor to obtain the features of three-channel space-time images, and adding a full-connection layer output preliminary suggestion region; the three-channel space-time image of the preliminary suggested area is stacked into a two-dimensional matrix by the longitudinal scanning lines frame by frame, so that an STMap is obtained; initializing an STMap, decomposing by using an improved DMD algorithm, and inputting a decomposition result into an input space-time feature extractor to obtain a motion information fluctuation feature map; and (3) inputting the motion information fluctuation feature map into a network by taking a target detection network YOLOv5 as a basic network for detection, performing post-processing, and outputting a behavior detection result of students in the video. The calculated amount of the network is effectively reduced, and the accuracy of fine granularity behavior detection is improved.

In the description of the present invention, it should be understood that the terms "coaxial," "bottom," "one end," "top," "middle," "another end," "upper," "one side," "top," "inner," "outer," "front," "center," "two ends," etc. indicate or are based on the orientation or positional relationship shown in the drawings, merely to facilitate description of the invention and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the invention.

In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "configured," "connected," "secured," "rotated," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intermediaries, or in communication with each other or in interaction with each other, unless explicitly defined otherwise, the meaning of the terms described above in this application will be understood by those of ordinary skill in the art in view of the specific circumstances.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A class behavior detection method based on space-time characteristics is characterized by comprising the following steps:

2. The classroom behavior detection method based on space-time features of claim 1, wherein the key frame comprises: video frames of video at intervals of K are set as key frames, where k=0, 1, …, n.

3. The classroom behavior detection method based on space-time characteristics according to claim 1, wherein the preliminary advice region is obtained by a softmax function, comprising:

where D represents the preliminary suggested region obtained by the softmax function, zi represents the one-dimensional vector compressed by features of different scales, and c represents the scale of the feature.

4. The method for detecting classroom behavior based on space-time features according to claim 1, wherein the step of stacking longitudinal scan lines on each other to form a two-dimensional matrix in the preliminary suggested area to obtain an STMap comprises:

by stacking the longitudinal scan lines, each representing the motion state of the object in the current keyframe, frame by frame according to the order of the three keyframes in the R, G, B channel, a two-dimensional matrix is formed, wherein 3 represents the number of pixels of each scan line and 3 represents the 3 keyframes in the three channels.

5. The classroom behavior detection method based on spatiotemporal features of claim 1, wherein said spatiotemporal feature extractor comprises:

and taking the UNet model as a basic space-time feature extractor, using a light coding module to replace an original coder in the UNet model, obtaining an improved UNet model, and taking the improved UNet model as a final space-time feature extractor.

6. The classroom behavior detection method based on space-time characteristics according to claim 1, wherein the step S4 specifically comprises:

s41: the linear time dependent operator A is used for reflecting the change of scanning line pixels in the STMap, so that time sequence information is extracted, and time sequence characteristics of a target are obtained;

s42: according to the time sequence characteristics of the target, the STMap is expressed as a combined matrix of characteristic vectors and characteristic values of the linear time dependence operator A, a low-order rank of the linear time dependence operator A is searched by using an improved DMD algorithm, the low-order rank is attached to a dynamic track of the target in an original video sequence, and the STMap is decomposed into a background part with a low rank and a sparse foreground part;

s43: inputting a low-rank background part and a sparse foreground part of the STMap into an improved space-time feature extractor UNet model, performing downsampling by using a multi-layer convolution to extract features, using correlation calculation to realize corresponding matching relations of different features, performing multiple upsampling operations by using a decoding module according to the features realizing the corresponding matching relations of the different features to obtain predicted optical flows corresponding to the different features, and fusing the output predicted optical flows and the features of corresponding coding layers to obtain semantic information from the different layers, thereby obtaining a motion information fluctuation feature map.

7. The classroom behavior detection method based on space-time characteristics according to claim 1, wherein inputting the motion information fluctuation feature map into a network for detection and post-processing, outputting the behavior detection result of the target in the video, comprises:

8. The method for detecting class behaviors based on spatiotemporal features of claim 7, wherein the confidence level comprises: