CN117058752A - Student classroom behavior detection method based on improved YOLOv7 - Google Patents
Student classroom behavior detection method based on improved YOLOv7 Download PDFInfo
- Publication number
- CN117058752A CN117058752A CN202310884525.5A CN202310884525A CN117058752A CN 117058752 A CN117058752 A CN 117058752A CN 202310884525 A CN202310884525 A CN 202310884525A CN 117058752 A CN117058752 A CN 117058752A
- Authority
- CN
- China
- Prior art keywords
- student
- yolov7
- image
- features
- behavior
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 49
- 230000006399 behavior Effects 0.000 claims abstract description 85
- 230000004927 fusion Effects 0.000 claims abstract description 12
- 238000005259 measurement Methods 0.000 claims abstract description 10
- 238000012549 training Methods 0.000 claims description 19
- 230000006870 function Effects 0.000 claims description 17
- 238000000034 method Methods 0.000 claims description 11
- 238000002372 labelling Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 230000008094 contradictory effect Effects 0.000 claims description 4
- 238000007500 overflow downdraw method Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000009471 action Effects 0.000 claims description 3
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 abstract description 3
- 238000012545 processing Methods 0.000 abstract description 2
- 238000011156 evaluation Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000006996 mental state Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012731 temporal analysis Methods 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 238000000700 time series analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A student classroom behavior detection method based on improved YOLOv7 belongs to the technical field of classroom behavior detection. Firstly, a detection pre-measurement head is changed into an ASFFdetection structure, so that a YOLOv7 network model is subjected to feature fusion on different feature levels to capture target information of different scales and improve target positioning capability. And replacing the CIoU loss function in the original YOLOv7 network model by WDLoss to adapt to unbalanced data and improve the generalization capability of the model. And finally, adding an attention mechanism ACmix module to enable the network to pay more attention to the object to be detected and enhance the feature processing capability of the network. The improved YOLOv7 model provided by the application can effectively detect the classroom behaviors of students under the conditions of lower image resolution, different scale targets and shielding.
Description
Technical Field
The application relates to the technical field of classroom behavior detection, in particular to a student classroom behavior detection method based on improved YOLOv 7.
Background
Along with the development of education industry, the importance of education and teaching fields to classroom teaching is more and more important, and the response and behavior change of students in the classroom are particularly focused. The proposal of new lessons puts higher demands on teaching evaluation. Meanwhile, in recent years, the construction of intelligent schools is advanced orderly in China, and school models featuring intelligent teaching, intelligent management, intelligent life and the like are built gradually. The student classroom is one of key links for constructing intelligent schools, and the quality of the student classroom is influenced by a plurality of aspects, including teaching design, classroom practice, teaching evaluation and the like. Among them, teaching evaluation by observing the classroom behavior of students is an effective and commonly used method.
In conventional teaching evaluation, there is generally an evaluation that a teacher sits in the back row to evaluate the state of a student in a class and the teaching situation of a teacher. However, it is difficult to comprehensively observe a specific lesson state of the student due to the position limitation of the assessment teacher. The assessment teacher can only assess the lesson status of a few students, resulting in incomplete assessment data. In addition, there are differences in the evaluation criteria, observation patterns, and thought angles of different evaluation teachers, which also lead to differences in teaching evaluation results. The mental states of the teacher are different in different periods of the same class, and the students are difficult to observe the class behaviors of the students in a concentrated manner for a long time, so that the difference of teaching evaluation is further increased. Therefore, the detection and analysis of the behaviors of students in a class from an objective perspective are of great significance to assessment teachers, lesson teachers, school leaders and parents of students. If the computer technology can be used for automatically identifying and detecting the classroom behaviors of students, comprehensive and objective data reference can be provided for teaching evaluation, and the teaching quality can be improved.
With the development of video analysis and computer vision technology, analyzing student behavior in classroom videos or pictures for teaching assessment can provide more accurate and objective feedback. In the field of classroom behavior detection, common algorithms include video-based motion recognition, gesture estimation, and target detection. Video motion recognition faces large-scale and high-dimensional video data processing problems, requires large amounts of computing resources and memory space, and acts have long-term dependencies in video, requiring time-dependent capture and modeling. Pose estimation it is challenging to estimate the poses of multiple people simultaneously in a multi-person scene, where the accuracy of the pose estimation can be degraded when parts of the human body are occluded or pose changes drastically. Time series analysis requires long-term dependencies to be established to accommodate different behavioral patterns and contexts. The target object can be accurately positioned by using the target detection to conduct behavior recognition, and a plurality of targets can be detected and recognized simultaneously in complex scenes such as multi-person interaction, group behaviors and the like. The target detection technology has made remarkable progress in the aspect of real-time application, and provides powerful support for behavior recognition tasks.
The problems of numerous student targets, serious shielding and the like exist in the classroom teaching video, and huge research challenges are brought to student behavior recognition in a classroom scene. In order to automatically identify the classroom behavior of all students, a more robust multi-person behavior identification model needs to be studied. The traditional student class behavior detection method based on target detection faces the influence of factors such as numerous student targets, inconsistent target sizes, target shielding, lower video or image resolution and the like, so that the behavior state of students in class cannot be accurately and efficiently identified.
Disclosure of Invention
Aiming at the defects in the prior art, the patent provides a student class behavior detection method based on improved YOLOv 7. The student class behavior detection method based on the improved YOLOv7 is mainly used for improving modules of a backbone network, a prediction head, an IOU calculation loss and the like of the YOLOv7, and the improved model focuses on objects to be detected, so that the behavior detection capability of a student class scene is improved. The problems mentioned in the background art above are solved. Experimental results prove that the method of the application has more advantages than the prior art.
In order to achieve the above purpose, the application adopts the technical scheme that: a student classroom behavior detection method based on improved YOLOv7 comprises the following steps:
step 1, acquiring a video of student classroom behavior, and frame-removing the acquired video to obtain a picture of the student classroom behavior;
and 2, preprocessing the image obtained in the step 1, marking the student class behavior data set by using a labelImg image marking tool, and dividing the data set to obtain the student class behavior data set.
Step 3, constructing a student class behavior detection network based on improved YOLOv7, adding an ACmix attention mechanism in a main network of a YOLOv7 algorithm, improving a prediction head part in the YOLOv7 algorithm, replacing a detection in the original YOLOv7 algorithm with an ASFFdetection structure, and introducing NWD-based Regression Loss as a loss function;
step 4, taking the image data in the data set as input, inputting the input data into the improved YOLOv7 model for training, and obtaining a trained student class behavior detection model;
step 5, sending the to-be-detected student class scene image into a trained model to obtain the behavior category and confidence of the student;
the image preprocessing and the image labeling obtained in the step 2 comprise the following steps:
step 2.1, preprocessing the obtained student classroom behavior image by using an OpenCV library, such as changing brightness and contrast, removing background and partial image, smoothing, reducing noise, and fusing pictures to obtain the student classroom behavior image;
step 2.2, performing action labeling on the student of the obtained student class behavior image by using a labelImg image labeling tool, and storing tag information in a txt file with the same name as the picture to obtain a student class behavior data set;
step 2.3, dividing the student class behavior image dataset into a training dataset and a testing dataset, and dividing all pictures and marked labels into 8: the scale of 2 is divided into training and test sets.
The student class behavior detection network based on the improved YOLOv7 mainly comprises an Input Backbone network (backbox), a Neck (Neck) and a Head (Head) 4 part, an ACmix attention convolution module is introduced into the Neck part of the basic YOLOv7, key target characteristics contained in a shallow network are highlighted, irrelevant information is weakened, the detection performance of an algorithm on small targets is improved, and the network is more focused on the targets to be detected. And replacing the Detect pre-measurement Head in the original network with the ASFFdetect pre-measurement Head in the Head part, and filtering out other layer characteristics carrying contradictory information by an optimal fusion method for learning different layer characteristics in the training process, thereby solving the problem of inconsistent learning targets. Introducing NWD-based Regression Loss to replace CIoU in the original YOLOv7 network model to optimize a loss function, adapting to unbalanced data and improving the generalization capability of the model;
the ACmix attention convolution module introduced in the neg section can be roughly divided into three first stages: the input features are projected by 3 1 x1 convolutions and then recombined into N blocks. Thus, a feature map containing 3×n intermediate features is obtained. And a second stage: using according to a different paradigm, for a self-attention path, intermediate features are collected into N groups, where each group contains three features, corresponding to q, k, v. For a convolution path with the kernel size of K, a lightweight full-connection layer is adopted to generate K2 feature graphs, and features are generated through shifting and aggregation. The third stage adds the outputs of the two paths, the intensity of which is shown by two learnable scalar controls:
F out =αF att +βF conv #(1)
wherein F is out Representing the final output of the path, F att Representing the output of the self-attention branch, F conv The values of parameters alpha and beta are both 1, representing the output of the convolved attention branch. The output results of the two branches are combined to give consideration to global features and local features, so that the detection effect of the network on the small target is improved.
The Head part replaces the Detect pre-header in the original network with an ASFFDetect pre-header, and the ASFF module comprises two steps: co-dimensional transformation and adaptive feature fusion, feature co-dimensional transformation: the feature map sizes of the different layers are not uniform, so that the same size needs to be reshaped whatever the fusion approach. The small size becomes larger in size and upsampling is required, and the large size becomes smaller in size and downsampling is required. Self-adaptive fusion: taking ASFF-3 as an example, the new fusion feature ASFF-3 can be obtained by multiplying the features X1, X2, X3 from level, level2, level3, respectively, by the weight parameters α3, β3 and γ3 for the features from different layers and adding them together:
wherein,meaning that the (i, j) vector of output features maps y between channels l ,/>Refers to the spatial importance weights of the feature map at three different levels to level L. Since the addition mode is adopted, the feature sizes of the level 1-3 layers output are the same when the addition is needed, the channel numbers are also the same, and the up-sampling or the down-sampling of the features of different layers and the channel number adjustment are needed. The weight parameters α, β and γ are obtained by convolving the features of level1 to level3 after the rest by 1×1. And parameters α, β and γ are all in the range of [0,1] by a softmax function after passing through the concat layer]Inner sum is 1:
wherein,meaning that the (i, j) vector of output features maps y between channels l ,/>Refers to the spatial importance weights of the feature map at three different levels to level L. The loss function of the replacement original model is designed as the loss function by NWD measurement:
wherein N is p For the Gaussian distribution model of the prediction block P, N g A gaussian distribution model for GT frame G; the NWD-based loss provides gradients |p n g|=0 and |p n g|=p or G.
The technical scheme of the application can obtain the following technical effects: according to the student class behavior detection method based on the improved YOLOv7, by adding the ACmix attention convolution module, key target characteristics contained in a shallow network can be highlighted, irrelevant information is weakened, the network is enabled to pay more attention to targets to be detected, and the problems of numerous student targets and target shielding under the class scene are solved. The detection pre-measurement Head of the Head part in the original YOLOv7 model is replaced by the ASFFdetection pre-measurement Head, and other layer characteristics carrying contradictory information are filtered through an optimal fusion method for learning different layer characteristics in the training process, so that the problem of inconsistent learning targets is solved, and the problem of large target size difference in the class scene in the prior art is solved. In addition, NWD-based Regression Loss is introduced to replace CIoU in the original YOLOv7 network model to optimize the loss function, adapt to unbalanced data, improve the generalization capability of the model and solve the detection problem under the condition of lower image resolution in a classroom scene.
Drawings
Fig. 1 is a flowchart of a student classroom behavior detection method based on improved YOLOv 7.
Fig. 2 is a network model structure of a student classroom behavior detection method based on improved YOLOv 7.
Fig. 3 is an effect diagram of generation of a student class behavior detection method based on improved YOLOv 7.
Detailed Description
The application is described in further detail below with reference to the attached drawings and to specific embodiments: the application will be further described by way of examples. It will be apparent that the described examples are only some, but not all embodiments of the application.
Fig. 1 shows a flow chart of a student class behavior detection method based on improved YOLOv 7. The student classroom behavior detection method based on the improved YOLOv7 specifically comprises the following steps:
step 1, acquiring a video of student classroom behavior, and frame-removing the acquired video to obtain a picture of the student classroom behavior;
and acquiring a student class behavior video, downloading a student class behavior data set, downloading the class behavior video from a data source Github, reading the video, setting the resolution of an output image, and outputting each frame in an image format in sequence to obtain a student class behavior image.
Step 2, preprocessing the image obtained in the step 1, marking a student class behavior data set by using a labelImg image marking tool, and dividing the data set to obtain the student class behavior data set;
and 2.1, preprocessing the student classroom behavior image by using an OpenCV library, changing brightness and contrast, removing background, carrying out smoothing treatment on partial images, reducing noise, and fusing pictures to obtain the student classroom behavior image.
Step 2.2, performing action labeling on the student of the obtained student class behavior image by using a labelImg image labeling tool, and storing tag information in a txt file with the same name as the picture to obtain a student class behavior data set;
step 2.3, dividing the student class behavior image dataset into a training dataset and a testing dataset, and dividing all pictures and marked labels into 8:2 is divided into a training set and a testing set;
step 3, constructing a student class behavior detection network based on improved YOLOv7, adding an ACmix attention mechanism in a main network of a YOLOv7 algorithm, improving a prediction head part in the YOLOv7 algorithm, replacing a detection in the original YOLOv7 algorithm with an ASFFdetection structure, and introducing NWD-based Regression Loss as a loss function;
the student class behavior detection network based on the improved YOLOv7 is constructed, and specifically comprises an attention adding convolution module, a change prediction head and a replacement loss function:
the student class behavior detection network based on the improved YOLOv7 mainly comprises an Input Backbone network (backbox), a Neck (Neck) and a Head (Head) 4 part, an ACmix attention convolution module is introduced into the Neck part of the basic YOLOv7, key target characteristics contained in a shallow network are highlighted, irrelevant information is weakened, the detection performance of an algorithm on small targets is improved, and the network is more focused on the targets to be detected. And replacing the Detect pre-measurement Head in the original network with the ASFFdetect pre-measurement Head in the Head part, and filtering out other layer characteristics carrying contradictory information by an optimal fusion method for learning different layer characteristics in the training process, thereby solving the problem of inconsistent learning targets. Introducing NWD-based Regression Loss to replace CIoU in the original YOLOv7 network model to optimize a loss function, adapting to unbalanced data and improving the generalization capability of the model;
the ACmix attention convolution module introduced in the neg section can be roughly divided into three first stages: the input features are projected by 3 1 x1 convolutions and then recombined into N blocks. Thus, a feature map containing 3×n intermediate features is obtained. And a second stage: using according to a different paradigm, for a self-attention path, intermediate features are collected into N groups, where each group contains three features, corresponding to q, k, v. For a convolution path with the kernel size of K, a lightweight full-connection layer is adopted to generate K2 feature graphs, and features are generated through shifting and aggregation. The third stage adds the outputs of the two paths, the intensity of which is shown by two learnable scalar controls:
F out =αF att +βF conv #(1)
wherein F is out Representing the final output of the path, F att Representing the output of the self-attention branch, F conv The values of parameters alpha and beta are both 1, representing the output of the convolved attention branch. The output results of the two branches are combined to give consideration to global features and local features, thereby improving the network pairDetection effect of the target.
The Head part replaces the Detect pre-header in the original network with an ASFFDetect pre-header, and the ASFF module comprises two steps: co-dimensional transformation and adaptive feature fusion, feature co-dimensional transformation: the feature map sizes of the different layers are not uniform, so that the same size needs to be reshaped whatever the fusion approach. The small size becomes larger in size and upsampling is required, and the large size becomes smaller in size and downsampling is required. Self-adaptive fusion: taking ASFF-3 as an example, the new fusion feature ASFF-3 can be obtained by multiplying the features X1, X2, X3 from level, level2, level3, respectively, by the weight parameters α3, β3 and γ3 for the features from different layers and adding them together:
wherein,meaning that the (i, j) vector of output features maps y between channels l ,/>Refers to the spatial importance weights of the feature map at three different levels to level L. Since the addition mode is adopted, the feature sizes of the level 1-3 layers output are the same when the addition is needed, the channel numbers are also the same, and the up-sampling or the down-sampling of the features of different layers and the channel number adjustment are needed. The weight parameters α, β and γ are obtained by convolving the features of level1 to level3 after the rest by 1×1. And parameters α, β and γ are all in the range of [0,1] by a softmax function after passing through the concat layer]Inner sum is 1:
wherein,meaning that the (i, j) vector of output features maps y between channels l ,/>Refers to the spatial importance weights of the feature map at three different levels to level L. The loss function of the replacement original model is designed as the loss function by NWD measurement:
wherein N is p For the Gaussian distribution model of the prediction block P, N g A gaussian distribution model for GT frame G; the NWD-based loss provides gradients |p n g|=0 and |p n g|=p or G.
Step 4, taking the image data in the data set as input, inputting the input data into the improved YOLOv7 model for training, and obtaining a trained student class behavior detection model;
and (3) sending the image data in the student class behavior data set into an improved YOLOv7 model for training, setting a training parameter, setting a learning rate to be 0.001, setting a confidence coefficient threshold to be 0.5, inputting all pictures in the training set into the improved YOLOv7 model for training, and repeating the training operation to obtain the model with the best training effect.
Step 5, sending the to-be-detected student class scene image into a trained model to obtain the behavior category and confidence of the student;
and detecting the classroom behavior of the students by using a trained student classroom behavior detection network based on improved YOLOv 7.
Fig. 2 shows a network model structure of a student class behavior detection method based on improved YOLOv 7. As shown in the figure, the method designs a novel network structure for student class behavior detection, adds an ACmix attention convolution module on the basis of a YOLOv7 network, changes ASFFdetection into a prediction head and replaces the original loss function by NWD-based Regression Loss. The experimental result proves that the method has advantages in accuracy and real-time compared with the prior art.
And 5, taking a group of student classroom behavior images as input, and detecting different input images through the step 5 to obtain a student classroom behavior detection image. Fig. 3 shows a detection effect diagram of the group of pictures, and from fig. 3, it can be seen that the method can accurately detect the behaviors of students in a multi-target blocked classroom scene, and the feasibility and effectiveness of the method are proved.
Claims (2)
1. The student classroom behavior detection method based on the improved YOLOv7 is characterized by comprising the following steps of:
step 1, acquiring a video of student classroom behavior, and frame-removing the acquired video to obtain a picture of the student classroom behavior;
step 2, preprocessing the image obtained in the step 1, marking a student class behavior data set by using a labelImg image marking tool, and dividing the data set to obtain the student class behavior data set;
step 3, constructing a student class behavior detection network based on improved YOLOv7, adding an ACmix attention convolution module in a YOLOv7 algorithm backbone network, improving a prediction head part in the YOLOv7 algorithm, replacing a detection in the original YOLOv7 algorithm with an ASFFdetection structure, and filtering out other layer characteristics carrying contradictory information by an optimal fusion method for learning different layer characteristics in a training process; simultaneously introducing NWD-based Regression Loss as a loss function;
the student classroom behavior detection network based on the improved YOLOv7 mainly comprises a Input, backbone, neck, head part; the ACmix attention convolution module introduced in the Neck part specifically comprises the following steps:
the first stage: projecting the input features through 3 1×1 convolutions, and then recombining the projected features into N blocks; obtaining a feature map comprising 3 xn intermediate features;
and a second stage: using according to different paradigms, for the self-attention path, collecting intermediate features into N groups, where each group contains three features, corresponding to q, k, v; for a convolution path with the kernel size of K, generating K2 feature graphs by adopting a lightweight full-connection layer, and generating features through shift and aggregation;
the third stage adds the outputs of the two paths, the intensity of which is shown by two learnable scalar controls:
F out =αF att +βF conv #(1)
wherein F is out Representing the final output of the path, F att Representing the output of the self-attention branch, F conv The values of parameters alpha and beta are 1, which represent the output of the convolved attention branch;
in the Head part, the Detect pre-header in the original network is replaced by an ASFFDetect pre-header, and the ASFFDetect module comprises two steps: the same-size transformation and self-adaptive feature fusion;
(1) Feature co-dimensional transformation: feature map sizes of different layers are inconsistent, and the feature map is remolded to the same size; the up-sampling is needed when the small size is changed into the large size, and the down-sampling is needed when the large size is changed into the small size;
(2) Self-adaptive fusion: multiplying the features X1, X2 and X3 from the levels 2 and 3 respectively by the weight parameters alpha, beta and gamma to obtain a new fusion feature ASFF-3:
wherein,meaning that the (i, j) vector of output features maps y between channels l ,/>Spatial importance weights of feature graphs at three different levels to level L; by adopting an addition mode, upsampling or downsampling is needed to be carried out on features of different layers, and the number of channels is adjusted; the output characteristics of the levels 1-3 are the same, and the number of channels is the same;
the weight parameters alpha, beta and gamma are obtained by convolution of 1X 1 through a characteristic diagram of level1 to level3 after the rest; and the weight parameters α, β and γ are all within [0,1] and sum to 1 by a softmax function after passing through the concat layer:
wherein,meaning that the (i, j) vector of output features maps y between channels l ,/>The method is characterized in that the spatial importance weights of the feature graphs from three different layers to layer L are adopted, and the loss function of the replacement original model is designed into the loss function by NWD measurement:
wherein N is p For the Gaussian distribution model of the prediction block P, N g A gaussian distribution model for GT frame G; providing a gradient |p n g|=0 and |p n g|=p or G based on the loss of NWD;
step 4, taking the image data in the data set as input, inputting the input data into the improved YOLOv7 model for training, and obtaining a trained student class behavior detection model;
and 5, sending the to-be-detected student class scene image into a trained model to obtain the behavior category and the confidence of the student.
2. The student class behavior detection method based on improved YOLOv7 of claim 1, wherein the image preprocessing and image labeling obtained in step 2 comprises the following steps:
step 2.1, preprocessing the obtained student classroom behavior image by using an OpenCV library, changing brightness and contrast, removing background, carrying out smoothing treatment on a local image, reducing noise, and fusing pictures to obtain the student classroom behavior image;
step 2.2, performing action labeling on the student of the obtained student class behavior image by using a labelImg image labeling tool, and storing tag information in a txt file with the same name as the picture to obtain a student class behavior data set;
and 2.3, dividing the student class behavior image data set into a training data set and a test data set, and dividing all pictures and marked labels into the training set and the test set according to the proportion of 8:2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310884525.5A CN117058752A (en) | 2023-07-19 | 2023-07-19 | Student classroom behavior detection method based on improved YOLOv7 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310884525.5A CN117058752A (en) | 2023-07-19 | 2023-07-19 | Student classroom behavior detection method based on improved YOLOv7 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117058752A true CN117058752A (en) | 2023-11-14 |
Family
ID=88661621
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310884525.5A Pending CN117058752A (en) | 2023-07-19 | 2023-07-19 | Student classroom behavior detection method based on improved YOLOv7 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117058752A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117611998A (en) * | 2023-11-22 | 2024-02-27 | 盐城工学院 | Optical remote sensing image target detection method based on improved YOLOv7 |
-
2023
- 2023-07-19 CN CN202310884525.5A patent/CN117058752A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117611998A (en) * | 2023-11-22 | 2024-02-27 | 盐城工学院 | Optical remote sensing image target detection method based on improved YOLOv7 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Modi et al. | Facial emotion recognition using convolution neural network | |
CN105718952A (en) | Method for focus classification of sectional medical images by employing deep learning network | |
Zhang et al. | An novel end-to-end network for automatic student engagement recognition | |
Hieu et al. | Identifying learners’ behavior from videos affects teaching methods of lecturers in Universities | |
CN112036447A (en) | Zero-sample target detection system and learnable semantic and fixed semantic fusion method | |
CN117058752A (en) | Student classroom behavior detection method based on improved YOLOv7 | |
Yang et al. | Student in-class behaviors detection and analysis system based on CBAM-YOLOv5 | |
Ho et al. | Application of rough set, GSM and MSM to analyze learning outcome—An example of introduction to education | |
Figueroa-Flores et al. | Saliency for free: Saliency prediction as a side-effect of object recognition | |
CN113688789B (en) | Online learning input degree identification method and system based on deep learning | |
Nanthini et al. | A Survey on Data Augmentation Techniques | |
CN115953836A (en) | Off-line class student classroom behavior intelligent identification and cognitive state association method | |
WO2022247151A1 (en) | Cognitive learning method based on brain mechanism | |
Jiang | Analysis of Students’ Role Perceptions and their Tendencies in Classroom Education Based on Visual Inspection | |
CN113469001A (en) | Student classroom behavior detection method based on deep learning | |
CN111444877B (en) | Classroom people number identification method based on video photos | |
CN109726690A (en) | Learner behavior image multizone based on DenseCap network describes method | |
Liao et al. | Predicting learners' multi-question performance based on neural networks | |
CN117894217B (en) | Mathematics topic guiding system for online learning system | |
Hoyle et al. | Nervo: Augmented reality mobile application for the science education of central and peripheral nervous systems | |
Liu et al. | Automatic Recognition and Application of Classroom Learning Behavior Based on ICAP Framework | |
Glaser et al. | Work-in-Progress–Computer Vision Methods to Examine Neurodiverse Gaze Patterns in 360-Video | |
Liu | The Detection of English Students’ Classroom Learning State in Higher Vocational Colleges Based on Improved SSD Algorithm | |
Jebli et al. | Proposal of a similarity measure for unified modeling language class diagram images using convolutional neural network. | |
Shou et al. | A Method for Analyzing Learning Sentiment Based on Classroom Time-Series Images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |