CN115240259A - Face detection method and face detection system based on YOLO deep network in classroom environment - Google Patents

Face detection method and face detection system based on YOLO deep network in classroom environment Download PDF

Info

Publication number
CN115240259A
CN115240259A CN202210894051.8A CN202210894051A CN115240259A CN 115240259 A CN115240259 A CN 115240259A CN 202210894051 A CN202210894051 A CN 202210894051A CN 115240259 A CN115240259 A CN 115240259A
Authority
CN
China
Prior art keywords
module
network
face detection
face
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210894051.8A
Other languages
Chinese (zh)
Inventor
王蓉芳
李智远
朱孟达
慕彩红
郝红侠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210894051.8A priority Critical patent/CN115240259A/en
Publication of CN115240259A publication Critical patent/CN115240259A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/32Normalisation of the pattern dimensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a face detection method and a face detection system based on a YOLO deep network in a classroom environment, which are improved on the basis of an original YOLOX algorithm, and a smaller pooling core is used in a spatial pyramid pooling structure of the network, so that a model can be helped to detect a small-scale face in the classroom environment more easily, and the overall face detection performance is improved; a mixed attention module is added in the network, so that the model can learn to inhibit useless background information, and the detection precision is improved; self-adaptive spatial feature fusion operation is added into the network, so that the problem of inconsistency in the PAFPN structure is solved; an EIOU loss function is used for replacing an IOU loss function, so that the width difference and the height difference between a real frame and a predicted frame are minimum, and the convergence speed can be accelerated; the problem of insufficient data is solved by using transfer learning pre-training operation, and the face detection precision of the model in a classroom environment is improved; and the dividing module is used for dividing the face detection data set acquired in the classroom environment into a training set, a verification set and a test set.

Description

Face detection method and face detection system based on YOLO deep network in classroom environment
Technical Field
The invention belongs to the technical field of deep learning detection, and particularly relates to a face detection method and a face detection system based on a YOLO deep network in a classroom environment.
Background
The classroom is one of application scenarios of the face detection technology. In the traditional teaching environment, a teacher can only count the class arrival of students by checking out or checking in a class, but when a large number of students exist, the method is very time-consuming. The face detection technology is introduced into a classroom, real-time detection and analysis can be performed on the classroom attendance rate, the concentration degree and other conditions of students, a teacher is helped to know the class attendance condition and the class state and the learning condition of each student, and then corresponding adjustment is made in the teaching mode and strategy to improve the teaching quality.
In a face detection task in a classroom environment, the following difficulties exist:
1. the face size in the classroom environment is generally small, and the distance between faces is small and difficult to distinguish;
2. the postures of students in the classroom environment cannot be estimated, and more serious shelters, atypical postures and fuzzy faces exist;
3. the background proportion in the classroom environment is large, which can affect the detection of the human face;
4. the scale change of front and back faces in a classroom environment is large, and the requirement on a detection algorithm is high;
5. the face detection data set is few and the data set manufacturing cost is high in the available classroom environment, and the training of a large complex network is not enough supported;
the prior art includes a face detection method based on traditional manual features and a face detection method based on deep learning.
Before the introduction of deep learning methods in the field of face detection, face detection is mainly based on the classical approach, i.e. manual features are extracted from an image (or a sliding window on an image), and then the features are input into a classifier (or a set of classifiers) to detect possible face regions. The performance of these detectors depends to a large extent on the computational efficiency and expressive power of the features. With the continuous improvement and exploration of researchers, the traditional manual feature-based face detection method achieves good detection effect. However, the characteristics designed manually according to experience have great limitations and are easily interfered by environmental factors (such as blur, occlusion, brightness and the like), so that the application scene of the face detection method based on the traditional manual characteristics is limited and the robustness in a complex environment is not good. Meanwhile, the traditional face detection algorithm cannot automatically extract features useful for detection tasks from the original image without human intervention, and the traditional method cannot process a large amount of data due to performance limitations.
With the breakthrough work of the deep neural network in 2012 on image classification, the mode of face detection has also undergone a huge transition. Inspired by the rapid development of deep learning in computer vision, in the past few years, a plurality of deep learning-based frameworks are applied to the field of face detection, and the detection accuracy is remarkably improved. Because of the advantages of high detection efficiency and strong stability, various Face detection algorithm models based on deep learning become the mainstream framework of the Face detection task, and with the research and exploration of researchers on new technologies and new networks, more and more excellent Face detection networks based on deep learning are proposed, such as YOLOX, YOLO-Face, YOLO5Face and the like, and the algorithms obtain very advanced results on various Face detection data bases. However, in the past, the face detection algorithm only pursues the improvement of the detection precision and neglects the size of the model and the detection speed of the algorithm, even though many algorithms exchange the improvement of the detection progress at the cost of increasing the network model and reducing the detection speed, the training difficulty of the algorithms is very high, the requirements on hardware equipment are very high, the calculation cost is high, the detection time is long, and the algorithms are difficult to be applied to the reality.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a face detection method and a face detection system thereof based on a YOLO deep network in a classroom environment, wherein a smaller pooling kernel is used in a spatial pyramid pooling structure of the network, so that a model can be helped to detect a small-scale face in the classroom environment more easily, and the overall face detection performance is improved; a mixed attention module is added in the network, so that the model learns to suppress useless background information, and the detection precision is improved; self-adaptive spatial feature fusion operation is added into the network, so that the problem of inconsistency in the PAFPN structure is solved; an EIOU loss function is used for replacing an IOU loss function, so that the width difference and the height difference between a real frame and a predicted frame are minimum, and the convergence speed can be accelerated; the problem of insufficient data is solved by using transfer learning pre-training operation, and the accuracy of the face detection of the model in a classroom environment is improved.
In order to realize the purpose, the technical scheme adopted by the invention is as follows:
a face detection method in a classroom environment based on a YOLO deep network comprises the following steps:
the method includes the steps that S1, a face detection data set in a classroom environment is divided into a training set, a verification set and a test set;
s2, reading the images in the training set and the verification set divided in the step S1, converting the images into an RGB format, adjusting the size of the images, and then performing data enhancement on the training set divided in the step S1;
s3, constructing a face detection convolutional neural network based on a YOLOX deep network in a classroom environment, and naming the face detection convolutional neural network as a YOLOXs-face;
s4, constructing a loss function of the method by using the EIOU loss function and the cross entropy loss function;
s5, training the YOLOXs-face network by using a pre-training data set to obtain a pre-training model;
s6, continuing to train the YOLOXs-face network on the basis of the pre-training model obtained in the step S5 by using the training set processed in the step S2, verifying by using the verification set processed in the step S2, and storing the optimal network model represented on the verification set;
s7, testing the network model saved in the step S6 by using the test set divided in the step S1 to obtain a face detection result in a classroom environment;
and S8, quantitatively evaluating the detection performance of the network model by using the F1 coefficient and the average precision of the detection result obtained in the step S7.
Specifically, in step S1, samples in a face detection data set in a classroom environment are randomly divided into a training set, a verification set, and a test set according to a ratio of 11.
Specifically, step S2 is:
s201, preprocessing the images in the verification set divided in the step S1, firstly converting the images into an RGB format, then scaling the sizes of the images in the verification set and the test set in an equal ratio by using a bilinear interpolation method, and finally unifying the sizes of the images by adding gray bars to the images;
s202, preprocessing the images in the training set divided in the step S1, firstly converting the images into an RGB format, then scaling the images in an equal ratio, and then scaling the aspect ratio of the images randomly; unifying the size of the image by a method of adding gray bars to the image, horizontally turning the image according to the probability, and finally randomly changing the tone, saturation and brightness of the image to realize data enhancement;
and S203, respectively adjusting the real frames of the verification set preprocessed in the step S201 and the training set preprocessed in the step S202.
Specifically, step S3 is:
s301, constructing a face detection network in a classroom environment based on a YOLO deep network, and naming the face detection network as a YOLOXs-face; the YOLOXs-face network comprises a feature extraction module, a feature enhancement module and a feature point prediction module.
S302, constructing a CBS module comprising a convolution layer, a batch normalization layer and a SiLU nonlinear activation layer;
s303, constructing a residual error module comprising a convolution layer, a batch normalization layer and a SiLU nonlinear activation layer;
s304, constructing a Focus module based on the CBS module in the step S302, firstly, slicing the input image, expanding the input image from three channels to twelve channels, and then performing convolution operation on the characteristic layer by using one CBS module;
s305, constructing an SPP module based on the CBS module of the step S302, wherein the module consists of the CBS module and a maximum pooling operation;
s306, constructing a CSP _ N module and a CSP2_ N module based on the CBS module in the step S302 and the residual error module in the step S303, wherein the CSP _ N module comprises a trunk branch and a residual error branch, the trunk branch comprises a CBS module and N residual error modules, the residual error branch comprises a CBS module, data are respectively input into the trunk branch and the residual error branch to obtain characteristic layers with the same size, and the characteristic layers are stacked and then output through the CBS module; the CSP2_ N module comprises a trunk branch and a residual branch, the trunk branch comprises a CBS module and N residual modules for removing residual edges, the residual branch comprises a CBS module, data are respectively input into the trunk branch and the residual branches to obtain characteristic layers with the same size, and the characteristic layers are stacked and then pass through the CBS module to obtain output;
s307, constructing a feature extraction module CSPDarkNet network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302, the Focus module in the step S304, the SPP module in the step S305, the CSP _ N module in the step S306 and the CSP2_ N module, wherein the feature extraction operation is carried out on input data by the structure; inputting the data in the training set subjected to data enhancement in the step S2 into a CSPDarkNet network, and obtaining three effective characteristic layers in the middle layer, the middle layer and the bottom layer of the CSPDarkNet structure;
s308, constructing a feature enhancement module Attention network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three CBAM Attention modules; inputting the three effective feature layers obtained in the step S307 into three CBAM attention modules respectively to obtain three mixed attention feature layers;
s309, constructing a feature enhancement module PAFPN network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302 and the CSP2_ N module in the step S306, wherein the network consists of the FPN and the PAN; inputting the three mixed attention feature layers obtained in the step S308 into a PAFPN network, firstly carrying out feature transfer fusion in the FPN network in an up-sampling mode, and then obtaining three enhanced feature layers in a FAN network in a down-sampling fusion mode;
s310, constructing a feature enhancement module ASFF network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three self-adaptive spatial feature fusion modules; inputting the three enhanced feature layers obtained in the step S309 into an ASFF network, and adaptively fusing different feature layers to obtain three fused feature layers;
s311, constructing a feature point prediction Yolo Head network of the face detection network Yoloxs-face in the step S301 based on the CBS module in the step S302, wherein the network consists of Yolo Head modules; inputting the three fusion feature layers obtained in the step S310 into a Yolo Head network, and performing classification and regression operation on the feature layers to obtain three prediction results with different scales;
and S312, integrating the prediction results obtained in the step S311 to obtain a final face detection result in the classroom environment.
Further, in step 301, the feature extraction module is composed of a CSPDarkNet network, the feature enhancement module is composed of an Attention network, a PAFAN network and an ASFF network, and the feature point prediction is composed of a Yolo Head network.
Further, the SPP module constructed in step S305 includes two CBS modules and three maximal pooling operation components with pooling kernel sizes of 7 × 7,5 × 5 and 3 × 3, respectively;
further, in step S307, a Focus module, a CBS module, a CSP _1 module, a CBS module, a CSP _3 module, a CBS module, an SPP module, and a CSP2_1 module are sequentially arranged in the CSPDarkNet network, and outputs of the two CSP _3 modules and the CSP2_1 module serve as effective feature layers.
Further, in step S311, the Yolo Head module firstly adjusts the number of channels input to the feature layer by using convolution operation, and then inputs the adjusted feature layer to the classification branch and the regression branch, respectively, where the classification branch firstly uses two CBS modules to extract features, and then uses convolution operation to predict categories, and the regression branch firstly uses two CBS modules to extract features, and then uses two 1 × 1 convolution operations to obtain confidence and regression parameters, respectively.
Further, in step S4, the loss function L of the algorithm is:
L=5·L EIOU +L OBJ +L CLS
wherein L is EIOU Is an EIOU loss function, representing the loss of the prediction box, L OBJ And L CLS The cross entropy loss function represents the loss of confidence and the loss of class prediction respectively.
Further, the EIOU loss function L EIOU Comprises the following steps:
Figure BDA0003768683840000081
wherein IOU represents the intersection ratio of the prediction box and the real box, b and b gt Respectively representing the central points of the prediction frame and the real frame, rho representing the Euclidean distance between the two points, C representing the diagonal distance of the minimum external frame of the prediction frame and the real frame, C w And C h The width and height of the minimum bounding box of the prediction box and the real box, respectively.
Further, for calculating L OBJ And L CLS Cross entropy loss function of L BWL Comprises the following steps:
L BWL =-(ylogσ(p)+(1-y)logσ(1-p))
wherein y is a label, p is a predicted value, and sigma represents a sigmod function.
Further, in step S5 and step S6, when the face detection network yolloxs-face is trained, an Adam optimizer is used for optimization.
A detection system of a face detection method in a classroom environment based on a YOLOX deep network comprises the following steps:
the dividing module is used for dividing a face detection data set acquired in a classroom environment into a training set, a verification set and a test set;
the preprocessing module is used for adjusting the sizes of the images of the verification set and the training set and then enhancing the data of the training set;
the network module is used for constructing a YOLOXs-face network based on a YOLOX deep network;
a pre-training module: training the YOLOXs-face network by using a pre-training data set to obtain a pre-training model;
the training module is used for training the YOLOXs-face network by using a training set processed by the preprocessing module on the basis of the pre-training model;
the verification module is used for verifying by using the verification set processed by the preprocessing module while training and storing the optimal network model represented on the verification set;
and the detection module is used for testing the saved optimal network model by using the divided test set to obtain a face detection result in the classroom environment.
The pre-training dataset is a WIDER FACE dataset.
Compared with the prior art, the invention has the following advantages:
1) A spatial pyramid pooling structure of smaller kernels is used. The network structure of the spatial pyramid pooling structure is shown in fig. 6, and the maximum pooling operation is performed on the input feature layers by using pooling cores with different sizes, so that spatial feature information with different sizes is extracted from the input feature layers, and the detection accuracy and robustness of the model can be improved by processing. Compared with a convolutional neural network model which only can process pictures with fixed input sizes and comprises a full connection layer, the space pyramid pooling structure does not limit the sizes of the input pictures, and the use scene of the network is more flexible. In the invention, the size of the pooling kernel is sequentially modified into 7 × 7,5 × 5 and 3 × 3, and the use of the smaller-scale pooling kernel in the spatial pyramid pooling structure of the network can help the model to more easily detect the small-scale face in the classroom environment and improve the overall face detection performance.
2) A mixing threshold attention mechanism fusion operation was added. The attention mechanism in computer vision is based on the attention thinking way of human beings. When processing visual information, human beings pay attention to the received information to different degrees, focus on the information favorable for result prediction, and automatically ignore irrelevant content. In computer vision, a mask (mask) is generally used to form an attention mechanism, and the model achieves the purpose of paying attention to important information and ignoring irrelevant content by giving different weights to each input position. According to the invention, an attention mechanism is added in the network, so that the model can learn and inhibit useless background information, and the detection precision is improved. The attention module flow chart used in the present invention is shown in fig. 7, the details of which are shown in fig. 8.
3) And adding an adaptive spatial feature fusion operation. In YOLOX-s, PAFPN network is used for carrying out feature fusion operation on three effective feature layers, then high-level semantic information is used for carrying out large target detection, and low-level semantic information is used for carrying out small target detection. In a classroom environment, the common difference of the scales of faces before and after a classroom is large, namely, a large-scale face and a small-scale face exist in the same picture, under the condition, the conflict between features on different layers often occupies the main part of the PAFPN, the inconsistency interferes with gradient calculation in the training process, and the effectiveness of a feature pyramid is reduced. The invention solves the problem of inconsistency in the PAFPN structure by adding the self-adaptive spatial feature fusion module behind the PAFPN structure. Fig. 9 shows the structure of the adaptive spatial feature fusion module by taking the adaptive spatial feature fusion module-3 as an example, the implementation of the adaptive feature fusion is simple, and the amount of computation added to the model is very small.
4) The loss function is improved using EIOU. Because of the limitations of IOU penalties used by YOLOX-s networks, the present invention uses EIOU penalties instead of IOU penalties in training the model. EIOU loss function three parts: the IOU loss, the center-to-center distance loss and the width-to-height loss, wherein the width-to-height loss directly minimizes the width difference and height difference between the real frame and the predicted frame, and can accelerate the convergence speed.
5) A transfer learning pre-training operation is used. For the FACE detection task under the classroom environment, the number of open data sets which can be used for model training is too small, the cost for manufacturing the data sets is high, and in order to solve the problem of insufficient data, the invention introduces a transfer learning pre-training operation, firstly trains a network by using a WIDER FACE data set to obtain a universal FACE detection model, and then trains the classroom environment FACE detection data set on the basis of the universal model to obtain the FACE detection model aiming at the classroom environment. Compared with the random initialization network parameters, the convergence speed of the model in the training process can be increased by using the transfer learning pre-training operation, and the face detection precision of the model in the classroom environment is improved.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a general structure diagram of the YOLOXs-face according to the present invention.
FIG. 3 is a diagram of the CSPDarkNet network of the present invention.
Fig. 4 is a network structure diagram of PAFPN according to the present invention.
FIG. 5 is a block diagram of the Yolo Head module.
Fig. 6 is a network structure diagram of a spatial pyramid pooling structure.
Fig. 7 is an overall structural view of the hybrid attention module.
FIG. 8 is a detailed block diagram of the channel attention mechanism and the spatial attention mechanism in the hybrid attention module, wherein (a) is the channel attention module and (b) is the spatial attention module.
FIG. 9 shows the structure of the adaptive spatial feature fusion module, taking the adaptive spatial feature fusion module-3 as an example.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
The invention provides a face detection method and a face detection system thereof in a classroom environment based on a YOLO deep network, firstly, a data set is divided into a training set, a verification set and a test set; then, data enhancement is carried out on the training set; then, constructing a face detection convolutional neural network in a classroom environment based on a YOLOX deep network; training the network model by using a pre-training data set to obtain a pre-training model; training the network model on the basis of a pre-training model by using a training set, and storing the model which is optimal in expression on a verification set; and finally, testing the test set by using the optimal model to obtain various index results of the face detection in the classroom environment. A series of improvement measures are added on the basis of an original YOLOX target detection algorithm, and the improvement measures comprise the steps of using a spatial pyramid pooling structure with a smaller kernel, adding a mixed attention module and an adaptive spatial feature fusion module, using an EIOU to improve a loss function, and using a migration learning pre-training operation, so that the accuracy of face detection in a classroom environment is effectively improved on the premise of less computing resources.
Referring to fig. 1, a face detection method in a classroom environment based on the YOLOX deep network includes the following steps:
s1, dividing a face detection data set in a classroom environment into a training set, a verification set and a test set, wherein the method specifically comprises the following steps:
s101, writing Python codes to divide a training set, a verification set and a test set of the SCUT-HEAD-PartA data set, wherein the SCUT-HEAD-PartA data set is composed of 2000 pictures, 1100 pictures are taken out at random to serve as the training set, 400 pictures serve as the verification set, and 500 pictures serve as the test set.
S2, reading the images in the training set and the verification set divided in the step S1, converting the images into an RGB format, adjusting the size of the images, and then performing data enhancement on the training set divided in the step S1, wherein the specific steps are as follows:
s201, preprocessing the images in the verification set divided in the step S1, firstly converting the images into an RGB format, then scaling the sizes of the images in the verification set and the test set in an equal ratio by using a bilinear interpolation method to enable the long edge of the images to be 640, finally creating a gray picture with the size of 640 multiplied by 640, and placing the scaled images in the center of the gray picture;
s202, preprocessing the images in the training set divided in the step S1, firstly converting the images into an RGB format, then scaling the images in an equal ratio, randomly changing the aspect ratio of the images, then creating a gray image with the size of 640 multiplied by 640, placing the scaled images in the center of the gray image, horizontally turning the image according to the probability, and finally randomly changing the tone, saturation and brightness of the images to realize data enhancement;
and S203, respectively adjusting the real frames of the verification set preprocessed in the step S201 and the training set preprocessed in the step S202.
S3, constructing a face detection convolutional neural network based on a Yolox deep network in the classroom environment in the step of FIG. 2, and naming the convolutional neural network as Yoloxs-face, which is specifically as follows:
s301, constructing a face detection network in a classroom environment based on a YOLO deep network, and naming the face detection network as a YOLOXs-face; the YOLOXs-face network comprises a feature extraction module, a feature enhancement module and a feature point prediction module, wherein the feature extraction module consists of a CSPDarkNet network, the feature enhancement module consists of an Attention network, a PAFAN network and an ASFF network, and the feature point prediction consists of a Yolo Head network;
s302, constructing a CBS module comprising 1 convolution layer, 1 batch normalization layer and 1 SiLU nonlinear activation layer;
the CBS module consists of 1 convolution layer, 1 batch normalization layer and 1 SiLU nonlinear activation layer, the size of convolution kernel of the convolution layer is changed according to different use scenes, and the batch normalization layer and the SiLU activation layer are arranged behind the convolution layer in sequence.
S303, constructing a residual error module comprising 2 convolutional layers, 2 batch normalization layers and 2 SiLU nonlinear activation layers;
the residual module is composed of 2 convolutional layers, 2 batch normalization layers and 2 SiLU nonlinear activation layers, and the sizes of convolution kernels of the 2 convolutional layers are as follows in sequence: 1 × 1 and 3 × 3, adding a layer of batch normalization layer kernel SiLU activation layer behind each convolution layer, and connecting the output layer and the input by residual errors to be used as the final output of the module.
S304, constructing a Focus module based on the CBS module in the step S302, firstly, slicing the input image, expanding the input image from three channels to twelve channels, and then performing convolution operation on the characteristic layer by using one CBS module;
the Focus module takes a value in every other pixel of a picture, expands the input from three channels to twelve channels, and then adjusts the number of channels of the characteristic layer by using a CBS module with the convolution kernel size of 3 multiplied by 3.
S305, constructing an SPP module based on the CBS module in the step S302, wherein the module comprises two CBS modules and three maximum pooling operations with the pooling core sizes of 7 × 7,5 × 5 and 3 × 3 respectively;
the SPP module firstly uses a CBS module with convolution kernel size of 1 multiplied by 1 to adjust the channel number of the input feature layer, then uses three pooling kernels with sizes of 7 multiplied by 7,5 multiplied by 5 and 3 multiplied by 3 to check the feature layer for maximum pooling operation, then stacks the three extracted feature layers and the initial feature layer, and then uses a CBS module with convolution kernel size of 1 multiplied by 1 to adjust the channel number of the stacked feature layers to obtain the final output.
S306, constructing a CSP _ N module and a CSP2_ N module based on the CBS module in the step S302 and the residual error module in the step S303, wherein the CSP _ N module comprises a trunk branch and a residual error branch, the trunk branch comprises a CBS module and N residual error modules, the residual error branch comprises a CBS module, data are respectively input into the two branches to obtain two feature layers with the same size, and the two feature layers are stacked and then output through the CBS module; the CSP2_ N module comprises a trunk branch and a residual branch, wherein the trunk branch comprises 1 CBS module and N residual modules for removing residual edges, the residual branch comprises one CBS module, data are respectively input into the two branches to obtain two characteristic layers with the same size, and the two characteristic layers are stacked and then pass through one CBS module to obtain output;
s307, constructing a feature extraction module CSPDarkNet network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302, the Focus module in the step S304, the SPP module in the step S305, the CSP _ N module in the step S306 and the CSP2_ N module, wherein the feature extraction operation is carried out on input data by the structure; inputting the data in the training set subjected to data enhancement in the step S2 into a CSPDarkNet network, and obtaining three effective characteristic layers with the sizes of 80 multiplied by 128, 40 multiplied by 256 and 20 multiplied by 512 respectively at the middle layer, the middle layer and the bottom layer of the CSPDarkNet structure;
the CSPDarkNet network sequentially comprises a Focus module, a CBS module, a CSP _1 module, a CBS module, a CSP _3 module, a CBS module, an SPP module and a CSP2_1 module, the output of the two CSP _3 modules and the output of the CSP2_1 module are used as effective characteristic layers, and the convolution kernel size of the CBS module is 3 multiplied by 3.
S308, constructing a feature enhancement module Attention network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three CBAM Attention modules; inputting the three effective feature layers obtained in step S307 into three CBAM attention modules, respectively, to obtain three mixed attention feature layers with sizes of 80 × 80 × 128, 40 × 40 × 256, and 20 × 20 × 512, respectively;
channel attention module: firstly, performing global maximum pooling and global average pooling on an input feature layer, then processing the pooled feature layers by using a shared full-connection layer, then adding the two obtained results, then processing by using a Sigmod activation function to obtain a weight of each channel of the input feature layer, and finally multiplying the weight by the input feature layer to obtain an output;
spatial attention module: firstly, taking a maximum value and an average value on a channel of each feature point of an input feature layer, then stacking the two results, adjusting the number of the channels by using convolution with the number of the channels being 1, then processing by using a Sigmod activation function to obtain a weight of each channel of the input feature layer, and finally multiplying the weight by the input feature layer to obtain output.
S309, constructing a feature enhancement module PAFPN network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302 and the CSP2_ N module in the step S306, wherein the network consists of the FPN and the PAN network, and the specific structure is shown in FIG. 4; inputting the three mixed attention feature layers obtained in the step S308 into a PAFPN network, firstly carrying out feature transfer fusion in the FPN network in an up-sampling mode, and then obtaining three enhanced feature layers with the sizes of 80 multiplied by 128, 40 multiplied by 256 and 20 multiplied by 512 in a FAN network in a down-sampling fusion mode;
s310, constructing a feature enhancement module ASFF network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three self-adaptive spatial feature fusion modules; inputting the three enhanced feature layers obtained in step S309 into an ASFF network, so that different feature layers are adaptively fused, and three fused feature layers with sizes of 80 × 80 × 128, 40 × 40 × 256, and 20 × 20 × 512 are obtained;
the adaptive feature fusion module enables the network to directly learn how to spatially filter other layer features, thus retaining only useful information for assembly. For a certain feature layer, the self-adaptive feature fusion module firstly integrates and adjusts other feature layers into the same size, then trains to find the optimal fusion method, and on each spatial position, different feature layers are fused in a self-adaptive manner, so that the features carrying contradictory information are filtered, and the distinguishing features are strengthened.
S311, constructing a feature point prediction Yolo Head network of the face detection network Yolo Xs-face in the step S301 based on the CBS module in the step S302, wherein the network consists of three Yolo Head modules, and the structures of the Yolo Head modules are shown in FIG. 5; inputting the three fused feature layers obtained in the step S310 into a Yolo Head network, and performing classification and regression operations on the feature layers to obtain three prediction results with the sizes of 80 × 80 × 6, 40 × 40 × 6 and 20 × 20 × 6 respectively;
the Yolo Head module firstly uses a 1 × 1 convolution operation to adjust the number of channels input into a feature layer, and respectively inputs the adjusted feature layer into a classification branch and a regression branch, wherein the classification branch firstly uses two CBS modules to extract features, the regression branch firstly uses two CBS modules to extract the features after using a 1 × 1 convolution operation to predict categories, and the regression branch respectively uses two 1 × 1 convolution operations to obtain confidence degrees and regression parameters;
s312, integrating the prediction results obtained in the step S311 to obtain a result with the size of 8400 (80 × 80+40 × 40+20 × 20) × 6, wherein 8400 represents the number of the prediction frames finally obtained by the network, 6 represents the prediction result of the network, and the result comprises the regression coefficients (x, y, w, h) of the prediction frames, the confidence level that the prediction frames comprise objects and the probability that the objects in the prediction frames are human faces;
s4, constructing a loss function L of the method by using the EIOU loss function and the cross entropy loss function; the method specifically comprises the following steps:
L=5·L EIOU +L OBJ +L CLS
wherein L is EIOU Is an EIOU loss function, representing the loss of the prediction box, L OBJ And L CLS Representing the loss of confidence coefficient and the loss of category prediction respectively for a cross entropy loss function;
further, the EIOU loss function L EIOU Comprises the following steps:
Figure BDA0003768683840000181
where IOU represents the intersection ratio of the prediction box and the real box, b and b gt Respectively representing the central points of the prediction frame and the real frame, rho representing the Euclidean distance between the two points, C representing the diagonal distance of the minimum external frame of the prediction frame and the real frame, C w And C h Respectively representing the width and height of the minimum circumscribed frame of the prediction frame and the real frame;
further, for calculating L OBJ And L CLS Cross entropy loss function of L BWL Comprises the following steps:
L BWL =-(ylogσ(p)+(1-y)logσ(1-p))
wherein y is a label, p is a predicted value, and sigma represents a sigmod function.
S5, training the YOLOXs-face network by using a pre-training data set to obtain a pre-training model; when the face detection network YOLOXs-face is trained, the batch processing size is 24, an Adam optimizer is adopted for optimization, the initial learning rate is 0.001, the learning rate is multiplied by 0.98 in each training round, and 400 training rounds are performed in total.
S6, continuing to train the YOLOXs-face network on the basis of the pre-training model obtained in the step S5 by using the training set processed in the step S2, verifying by using the verification set processed in the step S2, and storing the optimal network model represented on the verification set; in the steps S5 and S6, when the face detection network Yoloxs-face is trained, the batch processing size is 24, an Adam optimizer is adopted for optimization, the initial learning rate is 0.001, the learning rate is multiplied by 0.98 in each training round, and the training rounds are 400 in total.
S7, testing the network model saved in the step S6 by using the test set divided in the step S1 to obtain a face detection result in a classroom environment;
and S8, quantitatively evaluating the detection performance of the network model by using the F1 coefficient and the average precision of the detection result obtained in the step S7.
A detection system of a face detection method in a classroom environment based on a YOLOX deep network comprises the following steps:
the dividing module is used for dividing a face detection data set acquired in a classroom environment into a training set, a verification set and a test set;
the preprocessing module is used for adjusting the sizes of the images of the verification set and the training set and then enhancing the data of the training set;
the network module is used for constructing a YOLOXs-face network based on a YOLOX deep network;
a pre-training module: training a YOLOXs-face network by using a pre-training data set to obtain a pre-training model;
the training module is used for training the YOLOXs-face network by using a training set processed by the preprocessing module on the basis of the pre-training model;
the verification module is used for verifying by using a verification set processed by the preprocessing module during training and storing the optimal network model represented on the verification set;
and the detection module is used for testing the saved optimal network model by using the divided test set to obtain a face detection result in the classroom environment.
The pre-training dataset is a WIDER FACE dataset.
Simulation experiment
1. The experimental conditions are as follows:
table 1 experimental environment configuration of the present invention
Figure BDA0003768683840000201
2. Simulation content and result analysis:
the samples in the simulation experiment of the invention are from three parts: the first part is a SCUT-HEAD data set, the data set is a face detection data set aiming at a classroom environment and is divided into a part A and a part B, images of the part A come from classroom monitoring videos, and images of the part B come from the Internet, and the face detection data set is used for training a network to obtain a face detection model under the classroom environment and verifying and improving effectiveness; the second part is a WIDER FACE data set which is a popular human FACE detection reference data set and is used for realizing the transfer learning pre-training operation in the invention; the third part is picture data under a real classroom environment which is collected and labeled from a network, and the picture data is used for testing the generalization capability of the model in the invention.
The size of the images in the data used by the invention is not consistent, and the size of the images is unified to 640 multiplied by 640 in the data preprocessing stage.
The detection performance of the YOLOXs-face network model provided by the invention is quantitatively evaluated by using the F1 coefficient and the Average Precision (AP). The specific meanings of each index are as follows:
TP (True Positive): true positive, representing a correctly classified positive sample;
FN (False Negative): false negative examples, representing misclassified positive samples;
FP (False Positive): false positive case, representing a misclassified negative example;
TN (True Negative): the true negative, representing a negative example that is correctly classified.
In the human face detection task, whether each prediction result is correct or not needs to be judged first to obtain the indexes. Different from the classification problem, in the face detection task, it is necessary to determine whether the detection result is correct by calculating an Intersection Over Union (IOU) of the prediction frame and the real frame. The calculation method of the IoU comprises the following steps:
Figure BDA0003768683840000211
wherein, A and B respectively represent a prediction frame and a real frame of a human face.
Firstly, a picture is input into a model to obtain prediction frames, for each prediction frame, IOU values of the prediction frame and all real frames of the picture are calculated, and the maximum IOU value is taken as MaxIOU. At this time, a threshold (generally set to 0.5) is set, when MaxIOU is greater than the threshold, the prediction box is classified as a true positive case TP, otherwise, the prediction box is classified as a false positive case FP.
Recall (Recall) is for the original sample and indicates how many positive examples in the sample were predicted to be correct. There are also two possibilities, one to predict the original positive class as positive (TP) and the other to predict the original positive class as negative (FN), where TP + FN equals the number of real boxes:
Figure BDA0003768683840000212
precision (Precision) is for the prediction result, indicating how many of the samples predicted to be positive are true positive samples. Then two possibilities are possible to predict positive class as positive class (TP) and negative class as positive class (FP):
Figure BDA0003768683840000221
the F1 score is an index for measuring the accuracy of the two classification models, and takes into account the accuracy and recall of the classification models.
Figure BDA0003768683840000222
Average Precision (AP) is a performance metric for this class of algorithms that predict target location and class:
Figure BDA0003768683840000223
wherein p represents Precision, r represents Recall, and p is a function with r as a parameter.
The present invention uses ablation experiments to verify the effectiveness of the improvement.
TABLE 2 summary of ablation test results obtained from simulation experiments of the present invention
Figure BDA0003768683840000224
As can be seen from the results in table 2, the improvements of the present invention over the YOLOX deep network are all effective. Wherein, the EIOU is to replace the IOU loss function in YOLOX-s with the EIOU loss function, and the detection precision of the model is improved by 1% by the improvement; the ASFF is characterized in that an adaptive spatial feature fusion module is added in a network, and the detection precision of the model is improved by 0.1% through the improvement; the Attention means that a CBAM Attention mechanism module is added into a network, and the detection precision of the added model is improved by 0.05%; SPP (3, 5, 7) refers to the use of a smaller pooling kernel in the spatial pyramid pooling structure of the backbone network, and this improvement improves the detection performance of the network by 0.07%; finally, pretrained refers to the use of a transfer learning pretraining operation, which improves the detection performance of the network by 0.83%.
The invention compares the detection results of different networks on human faces, wherein YOLO-face is a face detection network based on YOLOv3, and Tinaface is one of the most advanced face detectors with the current detection effect.
TABLE 3 summary of model comparison results obtained from simulation experiments of the present invention
Figure BDA0003768683840000231
The results in table 3 show that, compared with other algorithms, the YOLOXs-face method provided by the present invention is very balanced and more suitable for real-time face detection task in classroom environment. Compared with a YOLO-face algorithm, the YOLOXs-face method provided by the invention has the advantages of better face detection performance, less parameter and calculation amount of the model and higher detection speed of the model. Compared with a YOLOX-s algorithm, the YOLOXs-face method provided by the invention greatly improves the detection precision of the model under the condition of only increasing a small amount of model parameters, calculated amount and detection time. Compared with a Tinaface algorithm, although the detection precision of the YOLOXs-face method provided by the invention is lower, the model parameter number, the calculated amount and the detection time of the picture are far less than those of Tinaface, so that the requirement of the YOLOXs-face on hardware equipment is lower, the popularization of the method is facilitated, meanwhile, the F1 score of the YOLOXs-face is higher than that of Tinaface, and the performance of the YOLOXs-face is better in comprehensive consideration.
In summary, the present invention provides a face detection method and a face detection system in a classroom environment based on a YOLO deep network, which perform a series of improvements on an original YOLO algorithm, including using a spatial pyramid pooling structure with a smaller kernel, adding a hybrid attention module and an adaptive spatial feature fusion module, using EIOU to improve a loss function, and using a migration learning pre-training operation, thereby achieving an improvement in the precision of face detection in the classroom environment on the premise of less computing resources.
The method has better detection effect on the human face in the classroom environment. Firstly, the method uses a smaller pooling core in the spatial pyramid pooling structure of the network, and can help the model to more easily detect the small-scale face in the classroom environment and improve the overall face detection performance. And secondly, a mixed threshold attention mechanism fusion operation and an adaptive spatial feature fusion operation are added, so that the influence of the environment on the face detection and the influence among faces with different scales are reduced, and the probability of false detection is reduced.
The invention has low requirement on hardware equipment and good universality. Compared with the prior art, the size of the model of the method provided by the invention is smaller, and the method can be well operated on equipment with small memory.
The invention has low calculation cost and short detection time. Compared with the prior art, the method has the advantages that the calculated amount of the network is smaller, and the network can better operate on equipment with poor performance.
The model in the invention adopts a modularized design idea, the modules are added or modified according to the defects of a basic network in a face detection task in a classroom environment, and with the development of a new technology and the proposal of a better network module, the invention can perform iterative update at any time and improve the performance of the model.

Claims (10)

1. A face detection method in a classroom environment based on a YOLO deep network is characterized in that: the method specifically comprises the following steps:
the method includes the steps that S1, a face detection data set in a classroom environment is divided into a training set, a verification set and a test set;
s2, reading the images in the training set and the verification set divided in the step S1, converting the images into an RGB format, adjusting the size of the images, and then performing data enhancement on the training set divided in the step S1;
s3, constructing a face detection convolutional neural network based on a YOLOX deep network in a classroom environment, and naming the face detection convolutional neural network as a YOLOXs-face;
s4, constructing a loss function by using the EIOU loss function and the cross entropy loss function;
s5, training the YOLOXs-face network by using a pre-training data set to obtain a pre-training model;
s6, continuously training the YOLOXs-face network on the basis of the pre-training model obtained in the step S5 by using the training set processed in the step S2, verifying by using the verification set processed in the step S2, and storing the optimal network model represented on the verification set;
s7, testing the network model saved in the step S6 by using the test set divided in the step S1 to obtain a face detection result in a classroom environment;
and S8, quantitatively evaluating the detection performance of the network model by using the F1 coefficient and the average precision of the detection result obtained in the step S7.
2. The method for detecting the face in the classroom environment based on the YOLO deep network as claimed in claim 1, wherein the method comprises the following steps: in the step S1, samples in the face detection data set in the classroom environment are randomly divided into a training set, a verification set, and a test set according to the ratio of 11.
3. The face detection method in the classroom environment based on the YOLO deep network of claim 1, wherein: the specific method of the step S2 is as follows:
s201, preprocessing the images in the verification set divided in the step S1, firstly converting the images into an RGB format, then scaling the sizes of the images in the verification set and the test set in an equal ratio by using a bilinear interpolation method, and finally unifying the sizes of the images by adding gray bars to the images;
s202, preprocessing the images in the training set divided in the step S1, firstly converting the images into an RGB format, then scaling the images in an equal ratio, and then scaling the aspect ratio of the images randomly; unifying the size of the image by a method of adding gray bars to the image, horizontally turning the image according to the probability, and finally randomly changing the tone, saturation and brightness of the image to realize data enhancement;
and S203, respectively adjusting the real frames of the verification set preprocessed in the step S201 and the training set preprocessed in the step S202.
4. The face detection method in the classroom environment based on the YOLO deep network of claim 1, wherein: the specific method of the step S3 comprises the following steps:
s301, constructing a face detection network based on a YOLO deep network in a classroom environment, and naming the face detection network as YOLOXs-face; the YOLOXs-face network comprises a feature extraction module, a feature enhancement module and a feature point prediction module;
s302, constructing a CBS module comprising a convolution layer, a batch normalization layer and a SiLU nonlinear activation layer;
s303, constructing a residual error module comprising a convolution layer, a batch normalization layer and a SiLU nonlinear activation layer;
s304, constructing a Focus module based on the CBS module in the step S302, firstly, slicing the input image, expanding the input image from three channels to twelve channels, and then performing convolution operation on the characteristic layer by using one CBS module;
s305, constructing an SPP module based on the CBS module of the step S302, wherein the module consists of the CBS module and a maximum pooling operation;
s306, constructing a CSP _ N module and a CSP2_ N module based on the CBS module in the step S302 and the residual error module in the step S303, wherein the CSP _ N module comprises a trunk branch and a residual error branch, the trunk branch comprises a CBS module and N residual error modules, the residual error branch comprises a CBS module, data are respectively input into the trunk branch and the residual error branch to obtain characteristic layers with the same size, and the characteristic layers are stacked and then output through the CBS module; the CSP2_ N module comprises a trunk branch and a residual branch, the trunk branch comprises a CBS module and N residual modules for removing residual edges, the residual branch comprises a CBS module, data are respectively input into the trunk branch and the residual branch to obtain characteristic layers with the same size, and the characteristic layers are stacked and then output through the CBS module;
s307, constructing a feature extraction module CSPDarkNet network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302, the Focus module in the step S304, the SPP module in the step S305, the CSP _ N module in the step S306 and the CSP2_ N module, wherein the feature extraction operation is carried out on input data by the structure; inputting the data in the training set subjected to data enhancement in the step S2 into a CSPDarkNet network, and obtaining three effective characteristic layers in the middle layer, the middle-lower layer and the bottom layer of the CSPDarkNet structure;
s308, constructing a feature enhancement module Attention network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three CBAM Attention modules; inputting the three effective feature layers obtained in the step S307 into three CBAM attention modules respectively to obtain three mixed attention feature layers;
s309, constructing a feature enhancement module PAFPN network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302 and the CSP2_ N module in the step S306, wherein the network consists of the FPN and the PAN; inputting the three mixed attention feature layers obtained in the step S308 into a PAFPN network, firstly performing feature transfer fusion in the FPN network in an upsampling mode, and then obtaining three enhanced feature layers in a FAN network in a downsampling fusion mode;
s310, constructing a feature enhancement module (ASFF) network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three self-adaptive spatial feature fusion modules; inputting the three enhanced feature layers obtained in the step S309 into an ASFF network, and adaptively fusing different feature layers to obtain three fused feature layers;
s311, constructing a feature point prediction Yolo Head network of the face detection network Yoloxs-face in the step S301 based on the CBS module in the step S302, wherein the network consists of Yolo Head modules; inputting the three fusion feature layers obtained in the step S310 into a Yolo Head network, and performing classification and regression operation on the feature layers to obtain three prediction results with different scales;
and S312, integrating the prediction results obtained in the step S311 to obtain a final face detection result in the classroom environment.
5. The method for detecting the face in the classroom environment based on the YOLO deep network of claim 4, wherein the face detection method comprises the following steps: in step 301, the feature extraction module is composed of a CSPDarkNet network, the feature enhancement module is composed of an Attention network, a PAFAN network and an ASFF network, and the feature point prediction is composed of a Yolo Head network; in the step S307, the CSPDarkNet network sequentially includes a Focus module, a CBS module, a CSP _1 module, a CBS module, a CSP _3 module, a CBS module, an SPP module, and a CSP2_1 module, and outputs of the two CSP _3 modules and the CSP2_1 module are used as an effective feature layer.
6. The face detection method in the classroom environment based on the YOLO deep network of claim 4, wherein the face detection method comprises: the SPP block constructed in step S305 consists of two CBS blocks and three maximal pooling operations with pooling kernel sizes of 7 × 7,5 × 5 and 3 × 3, respectively.
7. The method for detecting the face in the classroom environment based on the YOLO deep network of claim 4, wherein the face detection method comprises the following steps: in step S311, the Yolo Head module firstly adjusts the number of channels input to the feature layer by using convolution operation, and then inputs the adjusted feature layer to the classification branch and the regression branch, respectively, where the classification branch firstly uses two CBS modules to extract features, and then uses convolution operation to predict categories, and the regression branch firstly uses two CBS modules to extract features, and then uses two 1 × 1 convolution operations to obtain confidence and regression parameters, respectively.
8. The face detection method in the classroom environment based on the YOLO deep network of claim 1, wherein: in step S4, the loss function L is:
L=5·L EIOU +L OBJ +L CLS
wherein L is EIOU Is an EIOU loss function, representing the loss of the prediction box, L OBJ And L CLS Representing the loss of confidence coefficient and the loss of category prediction respectively for a cross entropy loss function;
further, the EIOU loss function L EIOU Comprises the following steps:
Figure FDA0003768683830000061
where IOU represents the intersection ratio of the prediction box and the real box, b and b gt Respectively representing the central points of the prediction frame and the real frame, rho representing the Euclidean distance between the two points, C representing the diagonal distance of the minimum external frame of the prediction frame and the real frame, C w And C h Respectively representing the width and height of the minimum circumscribed frame of the prediction frame and the real frame;
further, for calculating L OBJ And L CLS Cross entropy loss function of L BWL Comprises the following steps:
L BWL =-(ylogσ(p)+(1-y)logσ(1-p))
wherein y is a label, p is a predicted value, and sigma represents a sigmod function.
9. The face detection method in the classroom environment based on the YOLO deep network of claim 1, wherein: in the steps S5 and S6, an Adam optimizer is adopted for optimization when the face detection network Yoloxs-face is trained.
10. A detection system implementing the detection method of any one of claims 1 to 9, characterized by: the method comprises the following steps:
the dividing module is used for dividing a face detection data set acquired in a classroom environment into a training set, a verification set and a test set;
the preprocessing module is used for adjusting the sizes of the images of the verification set and the training set and then enhancing the data of the training set;
the network module is used for constructing a YOLOXs-face network based on a YOLOX deep network;
a pre-training module: training the YOLOXs-face network by using a pre-training data set to obtain a pre-training model;
the training module is used for training the YOLOXs-face network by using a training set processed by the preprocessing module on the basis of the pre-training model;
the verification module is used for verifying by using a verification set processed by the preprocessing module during training and storing the optimal network model represented on the verification set;
and the detection module is used for testing the saved optimal network model by using the divided test set to obtain a face detection result in the classroom environment.
CN202210894051.8A 2022-07-27 2022-07-27 Face detection method and face detection system based on YOLO deep network in classroom environment Pending CN115240259A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210894051.8A CN115240259A (en) 2022-07-27 2022-07-27 Face detection method and face detection system based on YOLO deep network in classroom environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210894051.8A CN115240259A (en) 2022-07-27 2022-07-27 Face detection method and face detection system based on YOLO deep network in classroom environment

Publications (1)

Publication Number Publication Date
CN115240259A true CN115240259A (en) 2022-10-25

Family

ID=83676552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210894051.8A Pending CN115240259A (en) 2022-07-27 2022-07-27 Face detection method and face detection system based on YOLO deep network in classroom environment

Country Status (1)

Country Link
CN (1) CN115240259A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116092168A (en) * 2023-03-27 2023-05-09 湖南乐然智能科技有限公司 Face recognition detection method in classroom environment
CN116310785A (en) * 2022-12-23 2023-06-23 兰州交通大学 Unmanned aerial vehicle image pavement disease detection method based on YOLO v4
CN118411277A (en) * 2024-07-02 2024-07-30 潍坊护理职业学院 Wisdom campus attendance data management system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310785A (en) * 2022-12-23 2023-06-23 兰州交通大学 Unmanned aerial vehicle image pavement disease detection method based on YOLO v4
CN116310785B (en) * 2022-12-23 2023-11-24 兰州交通大学 Unmanned aerial vehicle image pavement disease detection method based on YOLO v4
CN116092168A (en) * 2023-03-27 2023-05-09 湖南乐然智能科技有限公司 Face recognition detection method in classroom environment
CN118411277A (en) * 2024-07-02 2024-07-30 潍坊护理职业学院 Wisdom campus attendance data management system

Similar Documents

Publication Publication Date Title
CN110717481B (en) Method for realizing face detection by using cascaded convolutional neural network
CN115240259A (en) Face detection method and face detection system based on YOLO deep network in classroom environment
CN109543606A (en) A kind of face identification method that attention mechanism is added
CN112036447B (en) Zero-sample target detection system and learnable semantic and fixed semantic fusion method
CN111783841B (en) Garbage classification method, system and medium based on migration learning and model fusion
CN109948692B (en) Computer-generated picture detection method based on multi-color space convolutional neural network and random forest
CN112818969B (en) Knowledge distillation-based face pose estimation method and system
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN112287941B (en) License plate recognition method based on automatic character region perception
CN111242127A (en) Vehicle detection method with granularity level multi-scale characteristics based on asymmetric convolution
CN114898284B (en) Crowd counting method based on feature pyramid local difference attention mechanism
CN111626279A (en) Negative sample labeling training method and highly-automated bill identification method
CN111738054A (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
CN113628297A (en) COVID-19 deep learning diagnosis system based on attention mechanism and transfer learning
CN112070040A (en) Text line detection method for video subtitles
CN113537110A (en) False video detection method fusing intra-frame and inter-frame differences
CN113298817A (en) High-accuracy semantic segmentation method for remote sensing image
CN113780423A (en) Single-stage target detection neural network based on multi-scale fusion and industrial product surface defect detection model
CN111739037A (en) Semantic segmentation method for indoor scene RGB-D image
CN115862103A (en) Method and system for identifying face of thumbnail
CN113239866B (en) Face recognition method and system based on space-time feature fusion and sample attention enhancement
CN112837281A (en) Pin defect identification method, device and equipment based on cascade convolutional neural network
CN115880111B (en) Image-based virtual simulation training classroom teaching management method and system
CN116433980A (en) Image classification method, device, equipment and medium of impulse neural network structure
CN113256528B (en) Low-illumination video enhancement method based on multi-scale cascade depth residual error network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination