CN115240259A

CN115240259A - Face detection method and face detection system based on YOLO deep network in classroom environment

Info

Publication number: CN115240259A
Application number: CN202210894051.8A
Authority: CN
Inventors: 王蓉芳; 李智远; 朱孟达; 慕彩红; 郝红侠
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-07-27
Filing date: 2022-07-27
Publication date: 2022-10-25

Abstract

The invention discloses a face detection method and a face detection system based on a YOLO deep network in a classroom environment, which are improved on the basis of an original YOLOX algorithm, and a smaller pooling core is used in a spatial pyramid pooling structure of the network, so that a model can be helped to detect a small-scale face in the classroom environment more easily, and the overall face detection performance is improved; a mixed attention module is added in the network, so that the model can learn to inhibit useless background information, and the detection precision is improved; self-adaptive spatial feature fusion operation is added into the network, so that the problem of inconsistency in the PAFPN structure is solved; an EIOU loss function is used for replacing an IOU loss function, so that the width difference and the height difference between a real frame and a predicted frame are minimum, and the convergence speed can be accelerated; the problem of insufficient data is solved by using transfer learning pre-training operation, and the face detection precision of the model in a classroom environment is improved; and the dividing module is used for dividing the face detection data set acquired in the classroom environment into a training set, a verification set and a test set.

Description

Face detection method and face detection system based on YOLO deep network in classroom environment

Technical Field

The invention belongs to the technical field of deep learning detection, and particularly relates to a face detection method and a face detection system based on a YOLO deep network in a classroom environment.

Background

The classroom is one of application scenarios of the face detection technology. In the traditional teaching environment, a teacher can only count the class arrival of students by checking out or checking in a class, but when a large number of students exist, the method is very time-consuming. The face detection technology is introduced into a classroom, real-time detection and analysis can be performed on the classroom attendance rate, the concentration degree and other conditions of students, a teacher is helped to know the class attendance condition and the class state and the learning condition of each student, and then corresponding adjustment is made in the teaching mode and strategy to improve the teaching quality.

In a face detection task in a classroom environment, the following difficulties exist:

1. the face size in the classroom environment is generally small, and the distance between faces is small and difficult to distinguish;

2. the postures of students in the classroom environment cannot be estimated, and more serious shelters, atypical postures and fuzzy faces exist;

3. the background proportion in the classroom environment is large, which can affect the detection of the human face;

4. the scale change of front and back faces in a classroom environment is large, and the requirement on a detection algorithm is high;

5. the face detection data set is few and the data set manufacturing cost is high in the available classroom environment, and the training of a large complex network is not enough supported;

the prior art includes a face detection method based on traditional manual features and a face detection method based on deep learning.

Before the introduction of deep learning methods in the field of face detection, face detection is mainly based on the classical approach, i.e. manual features are extracted from an image (or a sliding window on an image), and then the features are input into a classifier (or a set of classifiers) to detect possible face regions. The performance of these detectors depends to a large extent on the computational efficiency and expressive power of the features. With the continuous improvement and exploration of researchers, the traditional manual feature-based face detection method achieves good detection effect. However, the characteristics designed manually according to experience have great limitations and are easily interfered by environmental factors (such as blur, occlusion, brightness and the like), so that the application scene of the face detection method based on the traditional manual characteristics is limited and the robustness in a complex environment is not good. Meanwhile, the traditional face detection algorithm cannot automatically extract features useful for detection tasks from the original image without human intervention, and the traditional method cannot process a large amount of data due to performance limitations.

With the breakthrough work of the deep neural network in 2012 on image classification, the mode of face detection has also undergone a huge transition. Inspired by the rapid development of deep learning in computer vision, in the past few years, a plurality of deep learning-based frameworks are applied to the field of face detection, and the detection accuracy is remarkably improved. Because of the advantages of high detection efficiency and strong stability, various Face detection algorithm models based on deep learning become the mainstream framework of the Face detection task, and with the research and exploration of researchers on new technologies and new networks, more and more excellent Face detection networks based on deep learning are proposed, such as YOLOX, YOLO-Face, YOLO5Face and the like, and the algorithms obtain very advanced results on various Face detection data bases. However, in the past, the face detection algorithm only pursues the improvement of the detection precision and neglects the size of the model and the detection speed of the algorithm, even though many algorithms exchange the improvement of the detection progress at the cost of increasing the network model and reducing the detection speed, the training difficulty of the algorithms is very high, the requirements on hardware equipment are very high, the calculation cost is high, the detection time is long, and the algorithms are difficult to be applied to the reality.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a face detection method and a face detection system thereof based on a YOLO deep network in a classroom environment, wherein a smaller pooling kernel is used in a spatial pyramid pooling structure of the network, so that a model can be helped to detect a small-scale face in the classroom environment more easily, and the overall face detection performance is improved; a mixed attention module is added in the network, so that the model learns to suppress useless background information, and the detection precision is improved; self-adaptive spatial feature fusion operation is added into the network, so that the problem of inconsistency in the PAFPN structure is solved; an EIOU loss function is used for replacing an IOU loss function, so that the width difference and the height difference between a real frame and a predicted frame are minimum, and the convergence speed can be accelerated; the problem of insufficient data is solved by using transfer learning pre-training operation, and the accuracy of the face detection of the model in a classroom environment is improved.

In order to realize the purpose, the technical scheme adopted by the invention is as follows:

a face detection method in a classroom environment based on a YOLO deep network comprises the following steps:

the method includes the steps that S1, a face detection data set in a classroom environment is divided into a training set, a verification set and a test set;

s2, reading the images in the training set and the verification set divided in the step S1, converting the images into an RGB format, adjusting the size of the images, and then performing data enhancement on the training set divided in the step S1;

s3, constructing a face detection convolutional neural network based on a YOLOX deep network in a classroom environment, and naming the face detection convolutional neural network as a YOLOXs-face;

s4, constructing a loss function of the method by using the EIOU loss function and the cross entropy loss function;

s5, training the YOLOXs-face network by using a pre-training data set to obtain a pre-training model;

s6, continuing to train the YOLOXs-face network on the basis of the pre-training model obtained in the step S5 by using the training set processed in the step S2, verifying by using the verification set processed in the step S2, and storing the optimal network model represented on the verification set;

s7, testing the network model saved in the step S6 by using the test set divided in the step S1 to obtain a face detection result in a classroom environment;

and S8, quantitatively evaluating the detection performance of the network model by using the F1 coefficient and the average precision of the detection result obtained in the step S7.

Specifically, in step S1, samples in a face detection data set in a classroom environment are randomly divided into a training set, a verification set, and a test set according to a ratio of 11.

Specifically, step S2 is:

s201, preprocessing the images in the verification set divided in the step S1, firstly converting the images into an RGB format, then scaling the sizes of the images in the verification set and the test set in an equal ratio by using a bilinear interpolation method, and finally unifying the sizes of the images by adding gray bars to the images;

s202, preprocessing the images in the training set divided in the step S1, firstly converting the images into an RGB format, then scaling the images in an equal ratio, and then scaling the aspect ratio of the images randomly; unifying the size of the image by a method of adding gray bars to the image, horizontally turning the image according to the probability, and finally randomly changing the tone, saturation and brightness of the image to realize data enhancement;

and S203, respectively adjusting the real frames of the verification set preprocessed in the step S201 and the training set preprocessed in the step S202.

Specifically, step S3 is:

s301, constructing a face detection network in a classroom environment based on a YOLO deep network, and naming the face detection network as a YOLOXs-face; the YOLOXs-face network comprises a feature extraction module, a feature enhancement module and a feature point prediction module.

S302, constructing a CBS module comprising a convolution layer, a batch normalization layer and a SiLU nonlinear activation layer;

s303, constructing a residual error module comprising a convolution layer, a batch normalization layer and a SiLU nonlinear activation layer;

s304, constructing a Focus module based on the CBS module in the step S302, firstly, slicing the input image, expanding the input image from three channels to twelve channels, and then performing convolution operation on the characteristic layer by using one CBS module;

s305, constructing an SPP module based on the CBS module of the step S302, wherein the module consists of the CBS module and a maximum pooling operation;

s306, constructing a CSP _ N module and a CSP2_ N module based on the CBS module in the step S302 and the residual error module in the step S303, wherein the CSP _ N module comprises a trunk branch and a residual error branch, the trunk branch comprises a CBS module and N residual error modules, the residual error branch comprises a CBS module, data are respectively input into the trunk branch and the residual error branch to obtain characteristic layers with the same size, and the characteristic layers are stacked and then output through the CBS module; the CSP2_ N module comprises a trunk branch and a residual branch, the trunk branch comprises a CBS module and N residual modules for removing residual edges, the residual branch comprises a CBS module, data are respectively input into the trunk branch and the residual branches to obtain characteristic layers with the same size, and the characteristic layers are stacked and then pass through the CBS module to obtain output;

s307, constructing a feature extraction module CSPDarkNet network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302, the Focus module in the step S304, the SPP module in the step S305, the CSP _ N module in the step S306 and the CSP2_ N module, wherein the feature extraction operation is carried out on input data by the structure; inputting the data in the training set subjected to data enhancement in the step S2 into a CSPDarkNet network, and obtaining three effective characteristic layers in the middle layer, the middle layer and the bottom layer of the CSPDarkNet structure;

s308, constructing a feature enhancement module Attention network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three CBAM Attention modules; inputting the three effective feature layers obtained in the step S307 into three CBAM attention modules respectively to obtain three mixed attention feature layers;

s309, constructing a feature enhancement module PAFPN network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302 and the CSP2_ N module in the step S306, wherein the network consists of the FPN and the PAN; inputting the three mixed attention feature layers obtained in the step S308 into a PAFPN network, firstly carrying out feature transfer fusion in the FPN network in an up-sampling mode, and then obtaining three enhanced feature layers in a FAN network in a down-sampling fusion mode;

s310, constructing a feature enhancement module ASFF network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three self-adaptive spatial feature fusion modules; inputting the three enhanced feature layers obtained in the step S309 into an ASFF network, and adaptively fusing different feature layers to obtain three fused feature layers;

s311, constructing a feature point prediction Yolo Head network of the face detection network Yoloxs-face in the step S301 based on the CBS module in the step S302, wherein the network consists of Yolo Head modules; inputting the three fusion feature layers obtained in the step S310 into a Yolo Head network, and performing classification and regression operation on the feature layers to obtain three prediction results with different scales;

and S312, integrating the prediction results obtained in the step S311 to obtain a final face detection result in the classroom environment.

Further, in step 301, the feature extraction module is composed of a CSPDarkNet network, the feature enhancement module is composed of an Attention network, a PAFAN network and an ASFF network, and the feature point prediction is composed of a Yolo Head network.

Further, the SPP module constructed in step S305 includes two CBS modules and three maximal pooling operation components with pooling kernel sizes of 7 × 7,5 × 5 and 3 × 3, respectively;

further, in step S307, a Focus module, a CBS module, a CSP _1 module, a CBS module, a CSP _3 module, a CBS module, an SPP module, and a CSP2_1 module are sequentially arranged in the CSPDarkNet network, and outputs of the two CSP _3 modules and the CSP2_1 module serve as effective feature layers.

Further, in step S311, the Yolo Head module firstly adjusts the number of channels input to the feature layer by using convolution operation, and then inputs the adjusted feature layer to the classification branch and the regression branch, respectively, where the classification branch firstly uses two CBS modules to extract features, and then uses convolution operation to predict categories, and the regression branch firstly uses two CBS modules to extract features, and then uses two 1 × 1 convolution operations to obtain confidence and regression parameters, respectively.

Further, in step S4, the loss function L of the algorithm is:

L＝5·L _EIOU +L _OBJ +L _CLS

wherein L is _EIOU Is an EIOU loss function, representing the loss of the prediction box, L _OBJ And L _CLS The cross entropy loss function represents the loss of confidence and the loss of class prediction respectively.

Further, the EIOU loss function L _EIOU Comprises the following steps:

wherein IOU represents the intersection ratio of the prediction box and the real box, b and b ^gt Respectively representing the central points of the prediction frame and the real frame, rho representing the Euclidean distance between the two points, C representing the diagonal distance of the minimum external frame of the prediction frame and the real frame, C _w And C _h The width and height of the minimum bounding box of the prediction box and the real box, respectively.

Further, for calculating L _OBJ And L _CLS Cross entropy loss function of L _BWL Comprises the following steps:

L _BWL ＝-(ylogσ(p)+(1-y)logσ(1-p))

wherein y is a label, p is a predicted value, and sigma represents a sigmod function.

Further, in step S5 and step S6, when the face detection network yolloxs-face is trained, an Adam optimizer is used for optimization.

A detection system of a face detection method in a classroom environment based on a YOLOX deep network comprises the following steps:

the dividing module is used for dividing a face detection data set acquired in a classroom environment into a training set, a verification set and a test set;

the preprocessing module is used for adjusting the sizes of the images of the verification set and the training set and then enhancing the data of the training set;

the network module is used for constructing a YOLOXs-face network based on a YOLOX deep network;

a pre-training module: training the YOLOXs-face network by using a pre-training data set to obtain a pre-training model;

the training module is used for training the YOLOXs-face network by using a training set processed by the preprocessing module on the basis of the pre-training model;

the verification module is used for verifying by using the verification set processed by the preprocessing module while training and storing the optimal network model represented on the verification set;

and the detection module is used for testing the saved optimal network model by using the divided test set to obtain a face detection result in the classroom environment.

The pre-training dataset is a WIDER FACE dataset.

Compared with the prior art, the invention has the following advantages:

1) A spatial pyramid pooling structure of smaller kernels is used. The network structure of the spatial pyramid pooling structure is shown in fig. 6, and the maximum pooling operation is performed on the input feature layers by using pooling cores with different sizes, so that spatial feature information with different sizes is extracted from the input feature layers, and the detection accuracy and robustness of the model can be improved by processing. Compared with a convolutional neural network model which only can process pictures with fixed input sizes and comprises a full connection layer, the space pyramid pooling structure does not limit the sizes of the input pictures, and the use scene of the network is more flexible. In the invention, the size of the pooling kernel is sequentially modified into 7 × 7,5 × 5 and 3 × 3, and the use of the smaller-scale pooling kernel in the spatial pyramid pooling structure of the network can help the model to more easily detect the small-scale face in the classroom environment and improve the overall face detection performance.

2) A mixing threshold attention mechanism fusion operation was added. The attention mechanism in computer vision is based on the attention thinking way of human beings. When processing visual information, human beings pay attention to the received information to different degrees, focus on the information favorable for result prediction, and automatically ignore irrelevant content. In computer vision, a mask (mask) is generally used to form an attention mechanism, and the model achieves the purpose of paying attention to important information and ignoring irrelevant content by giving different weights to each input position. According to the invention, an attention mechanism is added in the network, so that the model can learn and inhibit useless background information, and the detection precision is improved. The attention module flow chart used in the present invention is shown in fig. 7, the details of which are shown in fig. 8.

3) And adding an adaptive spatial feature fusion operation. In YOLOX-s, PAFPN network is used for carrying out feature fusion operation on three effective feature layers, then high-level semantic information is used for carrying out large target detection, and low-level semantic information is used for carrying out small target detection. In a classroom environment, the common difference of the scales of faces before and after a classroom is large, namely, a large-scale face and a small-scale face exist in the same picture, under the condition, the conflict between features on different layers often occupies the main part of the PAFPN, the inconsistency interferes with gradient calculation in the training process, and the effectiveness of a feature pyramid is reduced. The invention solves the problem of inconsistency in the PAFPN structure by adding the self-adaptive spatial feature fusion module behind the PAFPN structure. Fig. 9 shows the structure of the adaptive spatial feature fusion module by taking the adaptive spatial feature fusion module-3 as an example, the implementation of the adaptive feature fusion is simple, and the amount of computation added to the model is very small.

4) The loss function is improved using EIOU. Because of the limitations of IOU penalties used by YOLOX-s networks, the present invention uses EIOU penalties instead of IOU penalties in training the model. EIOU loss function three parts: the IOU loss, the center-to-center distance loss and the width-to-height loss, wherein the width-to-height loss directly minimizes the width difference and height difference between the real frame and the predicted frame, and can accelerate the convergence speed.

5) A transfer learning pre-training operation is used. For the FACE detection task under the classroom environment, the number of open data sets which can be used for model training is too small, the cost for manufacturing the data sets is high, and in order to solve the problem of insufficient data, the invention introduces a transfer learning pre-training operation, firstly trains a network by using a WIDER FACE data set to obtain a universal FACE detection model, and then trains the classroom environment FACE detection data set on the basis of the universal model to obtain the FACE detection model aiming at the classroom environment. Compared with the random initialization network parameters, the convergence speed of the model in the training process can be increased by using the transfer learning pre-training operation, and the face detection precision of the model in the classroom environment is improved.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a general structure diagram of the YOLOXs-face according to the present invention.

FIG. 3 is a diagram of the CSPDarkNet network of the present invention.

Fig. 4 is a network structure diagram of PAFPN according to the present invention.

FIG. 5 is a block diagram of the Yolo Head module.

Fig. 6 is a network structure diagram of a spatial pyramid pooling structure.

Fig. 7 is an overall structural view of the hybrid attention module.

FIG. 8 is a detailed block diagram of the channel attention mechanism and the spatial attention mechanism in the hybrid attention module, wherein (a) is the channel attention module and (b) is the spatial attention module.

FIG. 9 shows the structure of the adaptive spatial feature fusion module, taking the adaptive spatial feature fusion module-3 as an example.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

The invention provides a face detection method and a face detection system thereof in a classroom environment based on a YOLO deep network, firstly, a data set is divided into a training set, a verification set and a test set; then, data enhancement is carried out on the training set; then, constructing a face detection convolutional neural network in a classroom environment based on a YOLOX deep network; training the network model by using a pre-training data set to obtain a pre-training model; training the network model on the basis of a pre-training model by using a training set, and storing the model which is optimal in expression on a verification set; and finally, testing the test set by using the optimal model to obtain various index results of the face detection in the classroom environment. A series of improvement measures are added on the basis of an original YOLOX target detection algorithm, and the improvement measures comprise the steps of using a spatial pyramid pooling structure with a smaller kernel, adding a mixed attention module and an adaptive spatial feature fusion module, using an EIOU to improve a loss function, and using a migration learning pre-training operation, so that the accuracy of face detection in a classroom environment is effectively improved on the premise of less computing resources.

Referring to fig. 1, a face detection method in a classroom environment based on the YOLOX deep network includes the following steps:

s1, dividing a face detection data set in a classroom environment into a training set, a verification set and a test set, wherein the method specifically comprises the following steps:

s101, writing Python codes to divide a training set, a verification set and a test set of the SCUT-HEAD-PartA data set, wherein the SCUT-HEAD-PartA data set is composed of 2000 pictures, 1100 pictures are taken out at random to serve as the training set, 400 pictures serve as the verification set, and 500 pictures serve as the test set.

S2, reading the images in the training set and the verification set divided in the step S1, converting the images into an RGB format, adjusting the size of the images, and then performing data enhancement on the training set divided in the step S1, wherein the specific steps are as follows:

s201, preprocessing the images in the verification set divided in the step S1, firstly converting the images into an RGB format, then scaling the sizes of the images in the verification set and the test set in an equal ratio by using a bilinear interpolation method to enable the long edge of the images to be 640, finally creating a gray picture with the size of 640 multiplied by 640, and placing the scaled images in the center of the gray picture;

s202, preprocessing the images in the training set divided in the step S1, firstly converting the images into an RGB format, then scaling the images in an equal ratio, randomly changing the aspect ratio of the images, then creating a gray image with the size of 640 multiplied by 640, placing the scaled images in the center of the gray image, horizontally turning the image according to the probability, and finally randomly changing the tone, saturation and brightness of the images to realize data enhancement;

S3, constructing a face detection convolutional neural network based on a Yolox deep network in the classroom environment in the step of FIG. 2, and naming the convolutional neural network as Yoloxs-face, which is specifically as follows:

s301, constructing a face detection network in a classroom environment based on a YOLO deep network, and naming the face detection network as a YOLOXs-face; the YOLOXs-face network comprises a feature extraction module, a feature enhancement module and a feature point prediction module, wherein the feature extraction module consists of a CSPDarkNet network, the feature enhancement module consists of an Attention network, a PAFAN network and an ASFF network, and the feature point prediction consists of a Yolo Head network;

s302, constructing a CBS module comprising 1 convolution layer, 1 batch normalization layer and 1 SiLU nonlinear activation layer;

the CBS module consists of 1 convolution layer, 1 batch normalization layer and 1 SiLU nonlinear activation layer, the size of convolution kernel of the convolution layer is changed according to different use scenes, and the batch normalization layer and the SiLU activation layer are arranged behind the convolution layer in sequence.

S303, constructing a residual error module comprising 2 convolutional layers, 2 batch normalization layers and 2 SiLU nonlinear activation layers;

the residual module is composed of 2 convolutional layers, 2 batch normalization layers and 2 SiLU nonlinear activation layers, and the sizes of convolution kernels of the 2 convolutional layers are as follows in sequence: 1 × 1 and 3 × 3, adding a layer of batch normalization layer kernel SiLU activation layer behind each convolution layer, and connecting the output layer and the input by residual errors to be used as the final output of the module.

the Focus module takes a value in every other pixel of a picture, expands the input from three channels to twelve channels, and then adjusts the number of channels of the characteristic layer by using a CBS module with the convolution kernel size of 3 multiplied by 3.

S305, constructing an SPP module based on the CBS module in the step S302, wherein the module comprises two CBS modules and three maximum pooling operations with the pooling core sizes of 7 × 7,5 × 5 and 3 × 3 respectively;

the SPP module firstly uses a CBS module with convolution kernel size of 1 multiplied by 1 to adjust the channel number of the input feature layer, then uses three pooling kernels with sizes of 7 multiplied by 7,5 multiplied by 5 and 3 multiplied by 3 to check the feature layer for maximum pooling operation, then stacks the three extracted feature layers and the initial feature layer, and then uses a CBS module with convolution kernel size of 1 multiplied by 1 to adjust the channel number of the stacked feature layers to obtain the final output.

S306, constructing a CSP _ N module and a CSP2_ N module based on the CBS module in the step S302 and the residual error module in the step S303, wherein the CSP _ N module comprises a trunk branch and a residual error branch, the trunk branch comprises a CBS module and N residual error modules, the residual error branch comprises a CBS module, data are respectively input into the two branches to obtain two feature layers with the same size, and the two feature layers are stacked and then output through the CBS module; the CSP2_ N module comprises a trunk branch and a residual branch, wherein the trunk branch comprises 1 CBS module and N residual modules for removing residual edges, the residual branch comprises one CBS module, data are respectively input into the two branches to obtain two characteristic layers with the same size, and the two characteristic layers are stacked and then pass through one CBS module to obtain output;

s307, constructing a feature extraction module CSPDarkNet network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302, the Focus module in the step S304, the SPP module in the step S305, the CSP _ N module in the step S306 and the CSP2_ N module, wherein the feature extraction operation is carried out on input data by the structure; inputting the data in the training set subjected to data enhancement in the step S2 into a CSPDarkNet network, and obtaining three effective characteristic layers with the sizes of 80 multiplied by 128, 40 multiplied by 256 and 20 multiplied by 512 respectively at the middle layer, the middle layer and the bottom layer of the CSPDarkNet structure;

the CSPDarkNet network sequentially comprises a Focus module, a CBS module, a CSP _1 module, a CBS module, a CSP _3 module, a CBS module, an SPP module and a CSP2_1 module, the output of the two CSP _3 modules and the output of the CSP2_1 module are used as effective characteristic layers, and the convolution kernel size of the CBS module is 3 multiplied by 3.

S308, constructing a feature enhancement module Attention network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three CBAM Attention modules; inputting the three effective feature layers obtained in step S307 into three CBAM attention modules, respectively, to obtain three mixed attention feature layers with sizes of 80 × 80 × 128, 40 × 40 × 256, and 20 × 20 × 512, respectively;

channel attention module: firstly, performing global maximum pooling and global average pooling on an input feature layer, then processing the pooled feature layers by using a shared full-connection layer, then adding the two obtained results, then processing by using a Sigmod activation function to obtain a weight of each channel of the input feature layer, and finally multiplying the weight by the input feature layer to obtain an output;

spatial attention module: firstly, taking a maximum value and an average value on a channel of each feature point of an input feature layer, then stacking the two results, adjusting the number of the channels by using convolution with the number of the channels being 1, then processing by using a Sigmod activation function to obtain a weight of each channel of the input feature layer, and finally multiplying the weight by the input feature layer to obtain output.

S309, constructing a feature enhancement module PAFPN network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302 and the CSP2_ N module in the step S306, wherein the network consists of the FPN and the PAN network, and the specific structure is shown in FIG. 4; inputting the three mixed attention feature layers obtained in the step S308 into a PAFPN network, firstly carrying out feature transfer fusion in the FPN network in an up-sampling mode, and then obtaining three enhanced feature layers with the sizes of 80 multiplied by 128, 40 multiplied by 256 and 20 multiplied by 512 in a FAN network in a down-sampling fusion mode;

s310, constructing a feature enhancement module ASFF network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three self-adaptive spatial feature fusion modules; inputting the three enhanced feature layers obtained in step S309 into an ASFF network, so that different feature layers are adaptively fused, and three fused feature layers with sizes of 80 × 80 × 128, 40 × 40 × 256, and 20 × 20 × 512 are obtained;

the adaptive feature fusion module enables the network to directly learn how to spatially filter other layer features, thus retaining only useful information for assembly. For a certain feature layer, the self-adaptive feature fusion module firstly integrates and adjusts other feature layers into the same size, then trains to find the optimal fusion method, and on each spatial position, different feature layers are fused in a self-adaptive manner, so that the features carrying contradictory information are filtered, and the distinguishing features are strengthened.

S311, constructing a feature point prediction Yolo Head network of the face detection network Yolo Xs-face in the step S301 based on the CBS module in the step S302, wherein the network consists of three Yolo Head modules, and the structures of the Yolo Head modules are shown in FIG. 5; inputting the three fused feature layers obtained in the step S310 into a Yolo Head network, and performing classification and regression operations on the feature layers to obtain three prediction results with the sizes of 80 × 80 × 6, 40 × 40 × 6 and 20 × 20 × 6 respectively;

the Yolo Head module firstly uses a 1 × 1 convolution operation to adjust the number of channels input into a feature layer, and respectively inputs the adjusted feature layer into a classification branch and a regression branch, wherein the classification branch firstly uses two CBS modules to extract features, the regression branch firstly uses two CBS modules to extract the features after using a 1 × 1 convolution operation to predict categories, and the regression branch respectively uses two 1 × 1 convolution operations to obtain confidence degrees and regression parameters;

s312, integrating the prediction results obtained in the step S311 to obtain a result with the size of 8400 (80 × 80+40 × 40+20 × 20) × 6, wherein 8400 represents the number of the prediction frames finally obtained by the network, 6 represents the prediction result of the network, and the result comprises the regression coefficients (x, y, w, h) of the prediction frames, the confidence level that the prediction frames comprise objects and the probability that the objects in the prediction frames are human faces;

s4, constructing a loss function L of the method by using the EIOU loss function and the cross entropy loss function; the method specifically comprises the following steps:

L＝5·L _EIOU +L _OBJ +L _CLS

wherein L is _EIOU Is an EIOU loss function, representing the loss of the prediction box, L _OBJ And L _CLS Representing the loss of confidence coefficient and the loss of category prediction respectively for a cross entropy loss function;

further, the EIOU loss function L _EIOU Comprises the following steps:

where IOU represents the intersection ratio of the prediction box and the real box, b and b ^gt Respectively representing the central points of the prediction frame and the real frame, rho representing the Euclidean distance between the two points, C representing the diagonal distance of the minimum external frame of the prediction frame and the real frame, C _w And C _h Respectively representing the width and height of the minimum circumscribed frame of the prediction frame and the real frame;

L _BWL ＝-(ylogσ(p)+(1-y)logσ(1-p))

S5, training the YOLOXs-face network by using a pre-training data set to obtain a pre-training model; when the face detection network YOLOXs-face is trained, the batch processing size is 24, an Adam optimizer is adopted for optimization, the initial learning rate is 0.001, the learning rate is multiplied by 0.98 in each training round, and 400 training rounds are performed in total.

S6, continuing to train the YOLOXs-face network on the basis of the pre-training model obtained in the step S5 by using the training set processed in the step S2, verifying by using the verification set processed in the step S2, and storing the optimal network model represented on the verification set; in the steps S5 and S6, when the face detection network Yoloxs-face is trained, the batch processing size is 24, an Adam optimizer is adopted for optimization, the initial learning rate is 0.001, the learning rate is multiplied by 0.98 in each training round, and the training rounds are 400 in total.

a pre-training module: training a YOLOXs-face network by using a pre-training data set to obtain a pre-training model;

the verification module is used for verifying by using a verification set processed by the preprocessing module during training and storing the optimal network model represented on the verification set;

The pre-training dataset is a WIDER FACE dataset.

Simulation experiment

1. The experimental conditions are as follows:

table 1 experimental environment configuration of the present invention

2. Simulation content and result analysis:

the samples in the simulation experiment of the invention are from three parts: the first part is a SCUT-HEAD data set, the data set is a face detection data set aiming at a classroom environment and is divided into a part A and a part B, images of the part A come from classroom monitoring videos, and images of the part B come from the Internet, and the face detection data set is used for training a network to obtain a face detection model under the classroom environment and verifying and improving effectiveness; the second part is a WIDER FACE data set which is a popular human FACE detection reference data set and is used for realizing the transfer learning pre-training operation in the invention; the third part is picture data under a real classroom environment which is collected and labeled from a network, and the picture data is used for testing the generalization capability of the model in the invention.

The size of the images in the data used by the invention is not consistent, and the size of the images is unified to 640 multiplied by 640 in the data preprocessing stage.

The detection performance of the YOLOXs-face network model provided by the invention is quantitatively evaluated by using the F1 coefficient and the Average Precision (AP). The specific meanings of each index are as follows:

TP (True Positive): true positive, representing a correctly classified positive sample;

FN (False Negative): false negative examples, representing misclassified positive samples;

FP (False Positive): false positive case, representing a misclassified negative example;

TN (True Negative): the true negative, representing a negative example that is correctly classified.

In the human face detection task, whether each prediction result is correct or not needs to be judged first to obtain the indexes. Different from the classification problem, in the face detection task, it is necessary to determine whether the detection result is correct by calculating an Intersection Over Union (IOU) of the prediction frame and the real frame. The calculation method of the IoU comprises the following steps:

wherein, A and B respectively represent a prediction frame and a real frame of a human face.

Firstly, a picture is input into a model to obtain prediction frames, for each prediction frame, IOU values of the prediction frame and all real frames of the picture are calculated, and the maximum IOU value is taken as MaxIOU. At this time, a threshold (generally set to 0.5) is set, when MaxIOU is greater than the threshold, the prediction box is classified as a true positive case TP, otherwise, the prediction box is classified as a false positive case FP.

Recall (Recall) is for the original sample and indicates how many positive examples in the sample were predicted to be correct. There are also two possibilities, one to predict the original positive class as positive (TP) and the other to predict the original positive class as negative (FN), where TP + FN equals the number of real boxes:

precision (Precision) is for the prediction result, indicating how many of the samples predicted to be positive are true positive samples. Then two possibilities are possible to predict positive class as positive class (TP) and negative class as positive class (FP):

the F1 score is an index for measuring the accuracy of the two classification models, and takes into account the accuracy and recall of the classification models.

Average Precision (AP) is a performance metric for this class of algorithms that predict target location and class:

wherein p represents Precision, r represents Recall, and p is a function with r as a parameter.

The present invention uses ablation experiments to verify the effectiveness of the improvement.

TABLE 2 summary of ablation test results obtained from simulation experiments of the present invention

As can be seen from the results in table 2, the improvements of the present invention over the YOLOX deep network are all effective. Wherein, the EIOU is to replace the IOU loss function in YOLOX-s with the EIOU loss function, and the detection precision of the model is improved by 1% by the improvement; the ASFF is characterized in that an adaptive spatial feature fusion module is added in a network, and the detection precision of the model is improved by 0.1% through the improvement; the Attention means that a CBAM Attention mechanism module is added into a network, and the detection precision of the added model is improved by 0.05%; SPP (3, 5, 7) refers to the use of a smaller pooling kernel in the spatial pyramid pooling structure of the backbone network, and this improvement improves the detection performance of the network by 0.07%; finally, pretrained refers to the use of a transfer learning pretraining operation, which improves the detection performance of the network by 0.83%.

The invention compares the detection results of different networks on human faces, wherein YOLO-face is a face detection network based on YOLOv3, and Tinaface is one of the most advanced face detectors with the current detection effect.

TABLE 3 summary of model comparison results obtained from simulation experiments of the present invention

The results in table 3 show that, compared with other algorithms, the YOLOXs-face method provided by the present invention is very balanced and more suitable for real-time face detection task in classroom environment. Compared with a YOLO-face algorithm, the YOLOXs-face method provided by the invention has the advantages of better face detection performance, less parameter and calculation amount of the model and higher detection speed of the model. Compared with a YOLOX-s algorithm, the YOLOXs-face method provided by the invention greatly improves the detection precision of the model under the condition of only increasing a small amount of model parameters, calculated amount and detection time. Compared with a Tinaface algorithm, although the detection precision of the YOLOXs-face method provided by the invention is lower, the model parameter number, the calculated amount and the detection time of the picture are far less than those of Tinaface, so that the requirement of the YOLOXs-face on hardware equipment is lower, the popularization of the method is facilitated, meanwhile, the F1 score of the YOLOXs-face is higher than that of Tinaface, and the performance of the YOLOXs-face is better in comprehensive consideration.

In summary, the present invention provides a face detection method and a face detection system in a classroom environment based on a YOLO deep network, which perform a series of improvements on an original YOLO algorithm, including using a spatial pyramid pooling structure with a smaller kernel, adding a hybrid attention module and an adaptive spatial feature fusion module, using EIOU to improve a loss function, and using a migration learning pre-training operation, thereby achieving an improvement in the precision of face detection in the classroom environment on the premise of less computing resources.

The method has better detection effect on the human face in the classroom environment. Firstly, the method uses a smaller pooling core in the spatial pyramid pooling structure of the network, and can help the model to more easily detect the small-scale face in the classroom environment and improve the overall face detection performance. And secondly, a mixed threshold attention mechanism fusion operation and an adaptive spatial feature fusion operation are added, so that the influence of the environment on the face detection and the influence among faces with different scales are reduced, and the probability of false detection is reduced.

The invention has low requirement on hardware equipment and good universality. Compared with the prior art, the size of the model of the method provided by the invention is smaller, and the method can be well operated on equipment with small memory.

The invention has low calculation cost and short detection time. Compared with the prior art, the method has the advantages that the calculated amount of the network is smaller, and the network can better operate on equipment with poor performance.

The model in the invention adopts a modularized design idea, the modules are added or modified according to the defects of a basic network in a face detection task in a classroom environment, and with the development of a new technology and the proposal of a better network module, the invention can perform iterative update at any time and improve the performance of the model.

Claims

1. A face detection method in a classroom environment based on a YOLO deep network is characterized in that: the method specifically comprises the following steps:

s4, constructing a loss function by using the EIOU loss function and the cross entropy loss function;

s6, continuously training the YOLOXs-face network on the basis of the pre-training model obtained in the step S5 by using the training set processed in the step S2, verifying by using the verification set processed in the step S2, and storing the optimal network model represented on the verification set;

2. The method for detecting the face in the classroom environment based on the YOLO deep network as claimed in claim 1, wherein the method comprises the following steps: in the step S1, samples in the face detection data set in the classroom environment are randomly divided into a training set, a verification set, and a test set according to the ratio of 11.

3. The face detection method in the classroom environment based on the YOLO deep network of claim 1, wherein: the specific method of the step S2 is as follows:

4. The face detection method in the classroom environment based on the YOLO deep network of claim 1, wherein: the specific method of the step S3 comprises the following steps:

s301, constructing a face detection network based on a YOLO deep network in a classroom environment, and naming the face detection network as YOLOXs-face; the YOLOXs-face network comprises a feature extraction module, a feature enhancement module and a feature point prediction module;

s306, constructing a CSP _ N module and a CSP2_ N module based on the CBS module in the step S302 and the residual error module in the step S303, wherein the CSP _ N module comprises a trunk branch and a residual error branch, the trunk branch comprises a CBS module and N residual error modules, the residual error branch comprises a CBS module, data are respectively input into the trunk branch and the residual error branch to obtain characteristic layers with the same size, and the characteristic layers are stacked and then output through the CBS module; the CSP2_ N module comprises a trunk branch and a residual branch, the trunk branch comprises a CBS module and N residual modules for removing residual edges, the residual branch comprises a CBS module, data are respectively input into the trunk branch and the residual branch to obtain characteristic layers with the same size, and the characteristic layers are stacked and then output through the CBS module;

s307, constructing a feature extraction module CSPDarkNet network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302, the Focus module in the step S304, the SPP module in the step S305, the CSP _ N module in the step S306 and the CSP2_ N module, wherein the feature extraction operation is carried out on input data by the structure; inputting the data in the training set subjected to data enhancement in the step S2 into a CSPDarkNet network, and obtaining three effective characteristic layers in the middle layer, the middle-lower layer and the bottom layer of the CSPDarkNet structure;

s309, constructing a feature enhancement module PAFPN network of the face detection network YOLOXs-face in the step S301 based on the CBS module in the step S302 and the CSP2_ N module in the step S306, wherein the network consists of the FPN and the PAN; inputting the three mixed attention feature layers obtained in the step S308 into a PAFPN network, firstly performing feature transfer fusion in the FPN network in an upsampling mode, and then obtaining three enhanced feature layers in a FAN network in a downsampling fusion mode;

s310, constructing a feature enhancement module (ASFF) network of the face detection network YOLOXs-face in the step S301, wherein the network consists of three self-adaptive spatial feature fusion modules; inputting the three enhanced feature layers obtained in the step S309 into an ASFF network, and adaptively fusing different feature layers to obtain three fused feature layers;

5. The method for detecting the face in the classroom environment based on the YOLO deep network of claim 4, wherein the face detection method comprises the following steps: in step 301, the feature extraction module is composed of a CSPDarkNet network, the feature enhancement module is composed of an Attention network, a PAFAN network and an ASFF network, and the feature point prediction is composed of a Yolo Head network; in the step S307, the CSPDarkNet network sequentially includes a Focus module, a CBS module, a CSP _1 module, a CBS module, a CSP _3 module, a CBS module, an SPP module, and a CSP2_1 module, and outputs of the two CSP _3 modules and the CSP2_1 module are used as an effective feature layer.

6. The face detection method in the classroom environment based on the YOLO deep network of claim 4, wherein the face detection method comprises: the SPP block constructed in step S305 consists of two CBS blocks and three maximal pooling operations with pooling kernel sizes of 7 × 7,5 × 5 and 3 × 3, respectively.

7. The method for detecting the face in the classroom environment based on the YOLO deep network of claim 4, wherein the face detection method comprises the following steps: in step S311, the Yolo Head module firstly adjusts the number of channels input to the feature layer by using convolution operation, and then inputs the adjusted feature layer to the classification branch and the regression branch, respectively, where the classification branch firstly uses two CBS modules to extract features, and then uses convolution operation to predict categories, and the regression branch firstly uses two CBS modules to extract features, and then uses two 1 × 1 convolution operations to obtain confidence and regression parameters, respectively.

8. The face detection method in the classroom environment based on the YOLO deep network of claim 1, wherein: in step S4, the loss function L is:

L＝5·L _EIOU +L _OBJ +L _CLS

further, the EIOU loss function L _EIOU Comprises the following steps:

L _BWL ＝-(ylogσ(p)+(1-y)logσ(1-p))

9. The face detection method in the classroom environment based on the YOLO deep network of claim 1, wherein: in the steps S5 and S6, an Adam optimizer is adopted for optimization when the face detection network Yoloxs-face is trained.

10. A detection system implementing the detection method of any one of claims 1 to 9, characterized by: the method comprises the following steps: