CN117058752A

CN117058752A - Student classroom behavior detection method based on improved YOLOv7

Info

Publication number: CN117058752A
Application number: CN202310884525.5A
Authority: CN
Inventors: 王莉娜; 代启国
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-11-14

Abstract

A student classroom behavior detection method based on improved YOLOv7 belongs to the technical field of classroom behavior detection. Firstly, a detection pre-measurement head is changed into an ASFFdetection structure, so that a YOLOv7 network model is subjected to feature fusion on different feature levels to capture target information of different scales and improve target positioning capability. And replacing the CIoU loss function in the original YOLOv7 network model by WDLoss to adapt to unbalanced data and improve the generalization capability of the model. And finally, adding an attention mechanism ACmix module to enable the network to pay more attention to the object to be detected and enhance the feature processing capability of the network. The improved YOLOv7 model provided by the application can effectively detect the classroom behaviors of students under the conditions of lower image resolution, different scale targets and shielding.

Description

Student classroom behavior detection method based on improved YOLOv7

Technical Field

The application relates to the technical field of classroom behavior detection, in particular to a student classroom behavior detection method based on improved YOLOv 7.

Background

Along with the development of education industry, the importance of education and teaching fields to classroom teaching is more and more important, and the response and behavior change of students in the classroom are particularly focused. The proposal of new lessons puts higher demands on teaching evaluation. Meanwhile, in recent years, the construction of intelligent schools is advanced orderly in China, and school models featuring intelligent teaching, intelligent management, intelligent life and the like are built gradually. The student classroom is one of key links for constructing intelligent schools, and the quality of the student classroom is influenced by a plurality of aspects, including teaching design, classroom practice, teaching evaluation and the like. Among them, teaching evaluation by observing the classroom behavior of students is an effective and commonly used method.

In conventional teaching evaluation, there is generally an evaluation that a teacher sits in the back row to evaluate the state of a student in a class and the teaching situation of a teacher. However, it is difficult to comprehensively observe a specific lesson state of the student due to the position limitation of the assessment teacher. The assessment teacher can only assess the lesson status of a few students, resulting in incomplete assessment data. In addition, there are differences in the evaluation criteria, observation patterns, and thought angles of different evaluation teachers, which also lead to differences in teaching evaluation results. The mental states of the teacher are different in different periods of the same class, and the students are difficult to observe the class behaviors of the students in a concentrated manner for a long time, so that the difference of teaching evaluation is further increased. Therefore, the detection and analysis of the behaviors of students in a class from an objective perspective are of great significance to assessment teachers, lesson teachers, school leaders and parents of students. If the computer technology can be used for automatically identifying and detecting the classroom behaviors of students, comprehensive and objective data reference can be provided for teaching evaluation, and the teaching quality can be improved.

With the development of video analysis and computer vision technology, analyzing student behavior in classroom videos or pictures for teaching assessment can provide more accurate and objective feedback. In the field of classroom behavior detection, common algorithms include video-based motion recognition, gesture estimation, and target detection. Video motion recognition faces large-scale and high-dimensional video data processing problems, requires large amounts of computing resources and memory space, and acts have long-term dependencies in video, requiring time-dependent capture and modeling. Pose estimation it is challenging to estimate the poses of multiple people simultaneously in a multi-person scene, where the accuracy of the pose estimation can be degraded when parts of the human body are occluded or pose changes drastically. Time series analysis requires long-term dependencies to be established to accommodate different behavioral patterns and contexts. The target object can be accurately positioned by using the target detection to conduct behavior recognition, and a plurality of targets can be detected and recognized simultaneously in complex scenes such as multi-person interaction, group behaviors and the like. The target detection technology has made remarkable progress in the aspect of real-time application, and provides powerful support for behavior recognition tasks.

The problems of numerous student targets, serious shielding and the like exist in the classroom teaching video, and huge research challenges are brought to student behavior recognition in a classroom scene. In order to automatically identify the classroom behavior of all students, a more robust multi-person behavior identification model needs to be studied. The traditional student class behavior detection method based on target detection faces the influence of factors such as numerous student targets, inconsistent target sizes, target shielding, lower video or image resolution and the like, so that the behavior state of students in class cannot be accurately and efficiently identified.

Disclosure of Invention

Aiming at the defects in the prior art, the patent provides a student class behavior detection method based on improved YOLOv 7. The student class behavior detection method based on the improved YOLOv7 is mainly used for improving modules of a backbone network, a prediction head, an IOU calculation loss and the like of the YOLOv7, and the improved model focuses on objects to be detected, so that the behavior detection capability of a student class scene is improved. The problems mentioned in the background art above are solved. Experimental results prove that the method of the application has more advantages than the prior art.

In order to achieve the above purpose, the application adopts the technical scheme that: a student classroom behavior detection method based on improved YOLOv7 comprises the following steps:

step 1, acquiring a video of student classroom behavior, and frame-removing the acquired video to obtain a picture of the student classroom behavior;

and 2, preprocessing the image obtained in the step 1, marking the student class behavior data set by using a labelImg image marking tool, and dividing the data set to obtain the student class behavior data set.

Step 3, constructing a student class behavior detection network based on improved YOLOv7, adding an ACmix attention mechanism in a main network of a YOLOv7 algorithm, improving a prediction head part in the YOLOv7 algorithm, replacing a detection in the original YOLOv7 algorithm with an ASFFdetection structure, and introducing NWD-based Regression Loss as a loss function;

step 4, taking the image data in the data set as input, inputting the input data into the improved YOLOv7 model for training, and obtaining a trained student class behavior detection model;

step 5, sending the to-be-detected student class scene image into a trained model to obtain the behavior category and confidence of the student;

the image preprocessing and the image labeling obtained in the step 2 comprise the following steps:

step 2.1, preprocessing the obtained student classroom behavior image by using an OpenCV library, such as changing brightness and contrast, removing background and partial image, smoothing, reducing noise, and fusing pictures to obtain the student classroom behavior image;

step 2.2, performing action labeling on the student of the obtained student class behavior image by using a labelImg image labeling tool, and storing tag information in a txt file with the same name as the picture to obtain a student class behavior data set;

step 2.3, dividing the student class behavior image dataset into a training dataset and a testing dataset, and dividing all pictures and marked labels into 8: the scale of 2 is divided into training and test sets.

The student class behavior detection network based on the improved YOLOv7 mainly comprises an Input Backbone network (backbox), a Neck (Neck) and a Head (Head) 4 part, an ACmix attention convolution module is introduced into the Neck part of the basic YOLOv7, key target characteristics contained in a shallow network are highlighted, irrelevant information is weakened, the detection performance of an algorithm on small targets is improved, and the network is more focused on the targets to be detected. And replacing the Detect pre-measurement Head in the original network with the ASFFdetect pre-measurement Head in the Head part, and filtering out other layer characteristics carrying contradictory information by an optimal fusion method for learning different layer characteristics in the training process, thereby solving the problem of inconsistent learning targets. Introducing NWD-based Regression Loss to replace CIoU in the original YOLOv7 network model to optimize a loss function, adapting to unbalanced data and improving the generalization capability of the model;

the ACmix attention convolution module introduced in the neg section can be roughly divided into three first stages: the input features are projected by 3 1 x1 convolutions and then recombined into N blocks. Thus, a feature map containing 3×n intermediate features is obtained. And a second stage: using according to a different paradigm, for a self-attention path, intermediate features are collected into N groups, where each group contains three features, corresponding to q, k, v. For a convolution path with the kernel size of K, a lightweight full-connection layer is adopted to generate K2 feature graphs, and features are generated through shifting and aggregation. The third stage adds the outputs of the two paths, the intensity of which is shown by two learnable scalar controls:

F _out ＝αF _att +βF _conv #(1)

wherein F is _out Representing the final output of the path, F _att Representing the output of the self-attention branch, F _conv The values of parameters alpha and beta are both 1, representing the output of the convolved attention branch. The output results of the two branches are combined to give consideration to global features and local features, so that the detection effect of the network on the small target is improved.

The Head part replaces the Detect pre-header in the original network with an ASFFDetect pre-header, and the ASFF module comprises two steps: co-dimensional transformation and adaptive feature fusion, feature co-dimensional transformation: the feature map sizes of the different layers are not uniform, so that the same size needs to be reshaped whatever the fusion approach. The small size becomes larger in size and upsampling is required, and the large size becomes smaller in size and downsampling is required. Self-adaptive fusion: taking ASFF-3 as an example, the new fusion feature ASFF-3 can be obtained by multiplying the features X1, X2, X3 from level, level2, level3, respectively, by the weight parameters α3, β3 and γ3 for the features from different layers and adding them together:

wherein,meaning that the (i, j) vector of output features maps y between channels ^l ，/>Refers to the spatial importance weights of the feature map at three different levels to level L. Since the addition mode is adopted, the feature sizes of the level 1-3 layers output are the same when the addition is needed, the channel numbers are also the same, and the up-sampling or the down-sampling of the features of different layers and the channel number adjustment are needed. The weight parameters α, β and γ are obtained by convolving the features of level1 to level3 after the rest by 1×1. And parameters α, β and γ are all in the range of [0,1] by a softmax function after passing through the concat layer]Inner sum is 1:

wherein,meaning that the (i, j) vector of output features maps y between channels ^l ，/>Refers to the spatial importance weights of the feature map at three different levels to level L. The loss function of the replacement original model is designed as the loss function by NWD measurement:

wherein N is _p For the Gaussian distribution model of the prediction block P, N _g A gaussian distribution model for GT frame G; the NWD-based loss provides gradients |p n g|=0 and |p n g|=p or G.

The technical scheme of the application can obtain the following technical effects: according to the student class behavior detection method based on the improved YOLOv7, by adding the ACmix attention convolution module, key target characteristics contained in a shallow network can be highlighted, irrelevant information is weakened, the network is enabled to pay more attention to targets to be detected, and the problems of numerous student targets and target shielding under the class scene are solved. The detection pre-measurement Head of the Head part in the original YOLOv7 model is replaced by the ASFFdetection pre-measurement Head, and other layer characteristics carrying contradictory information are filtered through an optimal fusion method for learning different layer characteristics in the training process, so that the problem of inconsistent learning targets is solved, and the problem of large target size difference in the class scene in the prior art is solved. In addition, NWD-based Regression Loss is introduced to replace CIoU in the original YOLOv7 network model to optimize the loss function, adapt to unbalanced data, improve the generalization capability of the model and solve the detection problem under the condition of lower image resolution in a classroom scene.

Drawings

Fig. 1 is a flowchart of a student classroom behavior detection method based on improved YOLOv 7.

Fig. 2 is a network model structure of a student classroom behavior detection method based on improved YOLOv 7.

Fig. 3 is an effect diagram of generation of a student class behavior detection method based on improved YOLOv 7.

Detailed Description

The application is described in further detail below with reference to the attached drawings and to specific embodiments: the application will be further described by way of examples. It will be apparent that the described examples are only some, but not all embodiments of the application.

Fig. 1 shows a flow chart of a student class behavior detection method based on improved YOLOv 7. The student classroom behavior detection method based on the improved YOLOv7 specifically comprises the following steps:

and acquiring a student class behavior video, downloading a student class behavior data set, downloading the class behavior video from a data source Github, reading the video, setting the resolution of an output image, and outputting each frame in an image format in sequence to obtain a student class behavior image.

Step 2, preprocessing the image obtained in the step 1, marking a student class behavior data set by using a labelImg image marking tool, and dividing the data set to obtain the student class behavior data set;

and 2.1, preprocessing the student classroom behavior image by using an OpenCV library, changing brightness and contrast, removing background, carrying out smoothing treatment on partial images, reducing noise, and fusing pictures to obtain the student classroom behavior image.

step 2.3, dividing the student class behavior image dataset into a training dataset and a testing dataset, and dividing all pictures and marked labels into 8:2 is divided into a training set and a testing set;

the student class behavior detection network based on the improved YOLOv7 is constructed, and specifically comprises an attention adding convolution module, a change prediction head and a replacement loss function:

F _out ＝αF _att +βF _conv #(1)

wherein F is _out Representing the final output of the path, F _att Representing the output of the self-attention branch, F _conv The values of parameters alpha and beta are both 1, representing the output of the convolved attention branch. The output results of the two branches are combined to give consideration to global features and local features, thereby improving the network pairDetection effect of the target.

and (3) sending the image data in the student class behavior data set into an improved YOLOv7 model for training, setting a training parameter, setting a learning rate to be 0.001, setting a confidence coefficient threshold to be 0.5, inputting all pictures in the training set into the improved YOLOv7 model for training, and repeating the training operation to obtain the model with the best training effect.

and detecting the classroom behavior of the students by using a trained student classroom behavior detection network based on improved YOLOv 7.

Fig. 2 shows a network model structure of a student class behavior detection method based on improved YOLOv 7. As shown in the figure, the method designs a novel network structure for student class behavior detection, adds an ACmix attention convolution module on the basis of a YOLOv7 network, changes ASFFdetection into a prediction head and replaces the original loss function by NWD-based Regression Loss. The experimental result proves that the method has advantages in accuracy and real-time compared with the prior art.

And 5, taking a group of student classroom behavior images as input, and detecting different input images through the step 5 to obtain a student classroom behavior detection image. Fig. 3 shows a detection effect diagram of the group of pictures, and from fig. 3, it can be seen that the method can accurately detect the behaviors of students in a multi-target blocked classroom scene, and the feasibility and effectiveness of the method are proved.

Claims

1. The student classroom behavior detection method based on the improved YOLOv7 is characterized by comprising the following steps of:

step 3, constructing a student class behavior detection network based on improved YOLOv7, adding an ACmix attention convolution module in a YOLOv7 algorithm backbone network, improving a prediction head part in the YOLOv7 algorithm, replacing a detection in the original YOLOv7 algorithm with an ASFFdetection structure, and filtering out other layer characteristics carrying contradictory information by an optimal fusion method for learning different layer characteristics in a training process; simultaneously introducing NWD-based Regression Loss as a loss function;

the student classroom behavior detection network based on the improved YOLOv7 mainly comprises a Input, backbone, neck, head part; the ACmix attention convolution module introduced in the Neck part specifically comprises the following steps:

the first stage: projecting the input features through 3 1×1 convolutions, and then recombining the projected features into N blocks; obtaining a feature map comprising 3 xn intermediate features;

and a second stage: using according to different paradigms, for the self-attention path, collecting intermediate features into N groups, where each group contains three features, corresponding to q, k, v; for a convolution path with the kernel size of K, generating K2 feature graphs by adopting a lightweight full-connection layer, and generating features through shift and aggregation;

the third stage adds the outputs of the two paths, the intensity of which is shown by two learnable scalar controls:

F _out ＝αF _att +βF _conv #(1)

wherein F is _out Representing the final output of the path, F _att Representing the output of the self-attention branch, F _conv The values of parameters alpha and beta are 1, which represent the output of the convolved attention branch;

in the Head part, the Detect pre-header in the original network is replaced by an ASFFDetect pre-header, and the ASFFDetect module comprises two steps: the same-size transformation and self-adaptive feature fusion;

(1) Feature co-dimensional transformation: feature map sizes of different layers are inconsistent, and the feature map is remolded to the same size; the up-sampling is needed when the small size is changed into the large size, and the down-sampling is needed when the large size is changed into the small size;

(2) Self-adaptive fusion: multiplying the features X1, X2 and X3 from the levels 2 and 3 respectively by the weight parameters alpha, beta and gamma to obtain a new fusion feature ASFF-3:

wherein,meaning that the (i, j) vector of output features maps y between channels ^l ，/>Spatial importance weights of feature graphs at three different levels to level L; by adopting an addition mode, upsampling or downsampling is needed to be carried out on features of different layers, and the number of channels is adjusted; the output characteristics of the levels 1-3 are the same, and the number of channels is the same;

the weight parameters alpha, beta and gamma are obtained by convolution of 1X 1 through a characteristic diagram of level1 to level3 after the rest; and the weight parameters α, β and γ are all within [0,1] and sum to 1 by a softmax function after passing through the concat layer:

wherein,meaning that the (i, j) vector of output features maps y between channels ^l ，/>The method is characterized in that the spatial importance weights of the feature graphs from three different layers to layer L are adopted, and the loss function of the replacement original model is designed into the loss function by NWD measurement:

wherein N is _p For the Gaussian distribution model of the prediction block P, N _g A gaussian distribution model for GT frame G; providing a gradient |p n g|=0 and |p n g|=p or G based on the loss of NWD;

and 5, sending the to-be-detected student class scene image into a trained model to obtain the behavior category and the confidence of the student.

2. The student class behavior detection method based on improved YOLOv7 of claim 1, wherein the image preprocessing and image labeling obtained in step 2 comprises the following steps:

step 2.1, preprocessing the obtained student classroom behavior image by using an OpenCV library, changing brightness and contrast, removing background, carrying out smoothing treatment on a local image, reducing noise, and fusing pictures to obtain the student classroom behavior image;

and 2.3, dividing the student class behavior image data set into a training data set and a test data set, and dividing all pictures and marked labels into the training set and the test set according to the proportion of 8:2.