CN111368707B

CN111368707B - Face detection method, system, device and medium based on feature pyramid and dense block

Info

Publication number: CN111368707B
Application number: CN202010134064.6A
Authority: CN
Inventors: 曾凡智; 邹磊; 周燕
Original assignee: Foshan University
Current assignee: Foshan University
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2023-04-07
Anticipated expiration: 2040-03-02
Also published as: CN111368707A

Abstract

The invention discloses a face detection method, a system, equipment and a medium based on a characteristic pyramid and a dense block, wherein the method comprises the following steps: constructing a face detection network according to the dense block, the pooling layer and the feature fusion module based on the feature pyramid; acquiring a face image training set; training a face detection network by using a face image training set; and carrying out face detection on the image to be detected by using the trained face detection network to obtain a face detection result. The invention utilizes the dense blocks with low parameter number and high frequency feature multiplexing and combines the feature pyramid top-down feature fusion method, and can quickly and efficiently detect the human faces with different scales in the image.

Description

Face detection method, system, device and medium based on feature pyramid and dense block

Technical Field

The invention relates to a face detection method, a system, equipment and a medium based on a feature pyramid and a dense block, and belongs to the field of deep learning and image processing.

Background

The face detection technology is a key link in an automatic face recognition system, and is characterized in that any given image is searched by adopting a certain strategy to determine whether the image contains a face or not, and if the image contains the face, the position, the size and the posture of the face are returned.

Early face detection algorithms were mostly based on manual features (e.g., image texture), and in this regard there were face detection algorithms proposed in the Viola-Jones detector that combine cascaded features and Adaboost learning. Many scholars subsequently proposed face detection algorithms that could run in real time, such as new local features, new acceleration algorithms, and new cascaded architectures. In addition to the method based on the cascade framework, some researchers propose to add a Deformable Part Model (DPM) to a face detection algorithm, so that a better detection effect is achieved.

In recent years, with the continuous progress of the deep learning technology, the face detection technology has been further developed, and nowadays, the deep learning technology is added to more and more face detection algorithms. Farfade et al fine-tune the convolutional neural network trained on 1000 classes of ImageNet, and use it in the task of classifying faces and non-faces. Faceness trains a series of small networks and concatenates them for detecting partially occluded faces. CascadeCNN builds the cascade structure on the basis of the convolutional neural network, and achieves a very good effect. UnitBox introduces a new cross-over loss function, and realizes the effect of distinguishing human faces from non-human faces without errors.

However, most of the existing face detection algorithms have the following problems:

1) The detection time is too long. Most of the existing face detection algorithms are operated by combining an image pyramid technology, and the purpose is to detect faces with different scales in an image. The image pyramid technology is to scale an image into a plurality of images with different sizes, and is used for detecting objects (the object refers to a human face) with different sizes in the image. Many smaller faces may not be detected if only a single scale image is used for face detection. Therefore, the face detection algorithm is combined with the image pyramid technology, which is time-consuming in operation, and one image may need to be repeated for detecting faces with different scales for ten times.

2) The network parameters are numerous, and the model is large in size. At present, a plurality of face detection networks are designed very deeply (the number of network layers is large), and the face detection network has the advantages that more features can be extracted, faces and non-faces can be better distinguished, but the problems are that the number of parameters is large, and the calculation time is long.

Disclosure of Invention

In view of this, the present invention provides a face detection method, system, device and medium based on a feature pyramid and dense blocks, which can quickly and efficiently detect faces of different scales in an image by using dense blocks with low parameter number and high frequency feature multiplexing and combining a feature pyramid top-down feature fusion method.

The first purpose of the present invention is to provide a face detection method based on a feature pyramid and a dense block.

The second objective of the present invention is to provide a face detection system based on the feature pyramid and the dense block.

It is a third object of the invention to provide a computer apparatus.

It is a fourth object of the invention to provide a storage medium.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a face detection method based on a feature pyramid and a dense block, the method comprising:

constructing a face detection network according to the dense block, the pooling layer and the feature fusion module based on the feature pyramid;

acquiring a face image training set;

training a face detection network by using a face image training set;

and performing face detection on the image to be detected by using the trained face detection network to obtain a face detection result.

Further, training the face detection network by using the face image training set specifically includes:

dividing a face image training set into a plurality of batches of face images;

and setting a plurality of periods, and sequentially inputting the face images of each batch into the face detection network for training in each period so as to finish the training of the face detection network in the plurality of periods.

Furthermore, in the face detection network, eight dense blocks are provided, two pooling layers are provided, two feature fusion modules are provided, the eight dense blocks are respectively a first dense block, a second dense block, a third dense block, a fourth dense block, a fifth dense block, a sixth dense block, a seventh dense block and an eighth dense block, the two pooling layers are respectively a first pooling layer and a second pooling layer, and the feature fusion modules are respectively a first feature fusion module and a second feature fusion module;

the first dense block, the second dense block, the first pooling layer, the third dense block, the fourth dense block, the fifth dense block, the second pooling layer, the sixth dense block, the seventh dense block and the eighth dense block are connected in sequence;

the input of the first feature fusion module is respectively connected with the output of the fifth dense block and the output of the eighth dense block;

and the input of the second characteristic fusion module is respectively connected with the output of the second dense block and the output of the first characteristic fusion module.

Further, the performing face detection on the image to be detected by using the trained face detection network to obtain a face detection result specifically includes:

inputting an image to be detected into a trained face detection network, sequentially processing the image through a first dense block and a second dense block, and outputting to obtain a first image characteristic;

inputting the first image characteristic into a first pooling layer for down-sampling, sequentially processing the first image characteristic by a third dense block, a fourth dense block and a fifth dense block after the down-sampling, and outputting to obtain a second image characteristic;

inputting the second image characteristic into a second pooling layer for down-sampling, sequentially processing the second image characteristic by a sixth dense block, a seventh dense block and an eighth dense block after the down-sampling, and outputting to obtain a third image characteristic;

performing feature fusion on the second image features and the second image features by using a first fusion module, and outputting to obtain fourth image features;

and performing feature fusion on the first image feature and the fourth image feature by using a second fusion module, and outputting to obtain a face detection result.

Furthermore, the dense block comprises a plurality of elements of the same type, the elements are connected in sequence, and the elements are connected into a whole through jump connection;

each element comprises a 1 × 1 convolution layer, a first Swish function layer, a first batch normalization layer, a 3 × 3 convolution layer, a second Swish function layer and a second batch normalization layer which are connected in sequence, wherein the 1 × 1 convolution layer is used for compressing and reducing the dimension of upper-layer output data.

Further, the feature pyramid is preprocessed before feature fusion, and the preprocessing includes 1 × 1 convolution and 2 × upsampling.

Further, the face detection network training loss function is as follows:

L(p,u,t ^u ,v)＝αL _cls (p,u)+L _loc (t ^u ,v)

wherein, L (p, u, t) ^u V) represents the overall loss value of the face detection network; l is _cls Represents the classification loss value, L _cls (p,u)＝-log(p _u )；L _loc The regression box position loss value is represented,

p＝(p ₀ ...p _k ) Representing the probability value of each category of the detected object predicted by the face detection network, and k representing that the classified objects are a face and a non-face; u represents a human face; />

Information indicating the location at which the face detection network detected a face, and->

Respectively representing the x and y values of the upper left corner of the face frame containing the detected object and the width and the length of the regression frame; v = (v) _x ,v _y ,v _w ,v _h ) Indicating the position information of the artificially marked face, v _x ,v _y ,v _w ,v _h Respectively representing the x and y values of the upper left corner of the artificially marked face frame and the width and the length of the regression frame; α is a weight.

The second purpose of the invention can be achieved by adopting the following technical scheme:

a feature pyramid and dense block based face detection system, the system comprising:

the construction unit is used for constructing a face detection network according to the dense block, the pooling layer and the feature fusion module based on the feature pyramid;

the acquisition unit is used for acquiring a face image training set;

the training unit is used for training the face detection network by using a face image training set;

and the detection unit is used for carrying out face detection on the image to be detected by utilizing the trained face detection network to obtain a face detection result.

The third purpose of the invention can be achieved by adopting the following technical scheme:

a computer device comprising a processor and a memory for storing a processor-executable program, the processor implementing the above-described super-resolution image reconstruction method when executing the program stored in the memory.

The fourth purpose of the invention can be achieved by adopting the following technical scheme:

a storage medium stores a program that, when executed by a processor, implements the above-described super-resolution image reconstruction method.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention has less parameters and more stable training, uses the dense block as the main module of the face detection network, and utilizes the advantage of high-frequency feature multiplexing of the dense block, thereby realizing high-quality feature extraction under the condition of less network layers and also ensuring the appearance of gradient disappearance phenomenon in the network training process; in addition, the human face detection time is short, and the image pyramid technology used in the traditional human face detection algorithm is replaced by the characteristic pyramid technology, so that the algorithm can realize a faster detection effect under the condition of unchanged detection effect, and can meet the application requirements of scenes with low requirements on smaller human faces.

2. The trained face detection network has small volume, belongs to a lighter and faster face detection algorithm, can be transplanted into numerous small embedded devices or devices with smaller memories, can also be added into algorithms such as face recognition, face expression recognition, age recognition and the like, can acquire a face from an image more quickly on the premise of not influencing the running time of the method, and has wide application value.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a flowchart of a face detection method based on a feature pyramid and a dense block in embodiment 1 of the present invention.

Fig. 2 is a general architecture diagram of a face detection network according to embodiment 1 of the present invention.

Fig. 3 is a structural view of a dense block composition in example 1 of the present invention.

Fig. 4 is a structural diagram of a feature pyramid in embodiment 1 of the present invention.

Fig. 5 is a flowchart of training a face detection network by using a face image training set according to embodiment 1 of the present invention.

Fig. 6 is a flowchart of performing face detection on an image to be detected by using a trained face detection network in embodiment 1 of the present invention.

Fig. 7 is a block diagram of a face detection system based on a feature pyramid and a dense block according to embodiment 2 of the present invention.

Fig. 8 is a block diagram of a training unit according to embodiment 2 of the present invention.

Fig. 9 is a block diagram of a computer device according to embodiment 3 of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, it is obvious that the described embodiments are some, but not all, embodiments of the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.

Example 1:

as shown in fig. 1, the present embodiment provides a face detection method based on a feature pyramid and a dense block, which includes the following steps:

s101, constructing a face detection network according to the dense block, the pooling layer and the feature fusion module based on the feature pyramid.

The general architecture of the face detection network is shown in fig. 2, wherein eight dense blocks are provided, two pooling layers are provided, two feature fusion modules are provided, the eight dense blocks are respectively a first dense block, a second dense block, a third dense block, a fourth dense block, a fifth dense block, a sixth dense block, a seventh dense block and an eighth dense block, the two pooling layers are respectively a first pooling layer and a second pooling layer, and the feature fusion modules are respectively a first feature fusion module and a second feature fusion module.

As can be seen from fig. 2, the first dense block, the second dense block, the first pooling layer, the third dense block, the fourth dense block, the fifth dense block, the second pooling layer, the sixth dense block, the seventh dense block, and the eighth dense block are sequentially connected; the input of the first characteristic fusion module is respectively connected with the output of the fifth dense block and the output of the eighth dense block; the input of the second feature fusion module is connected with the output of the second dense block and the output of the first feature fusion module respectively.

Further, each dense block has a structure as shown in fig. 3, and includes a plurality of elements of the same type, the plurality of elements are connected in sequence, each element includes a 1 × 1 convolution layer, a first Swish function layer, a first batch normalization layer, a 3 × 3 convolution layer, a second Swish function layer, and a second batch normalization layer, which are connected in sequence, and the 1 × 1 convolution layer is used for compressing and reducing the dimension of the output data of the upper layer to reduce the amount of computation; the function Swish is activated to replace a ReLU function in the traditional network, so that the nonlinear expression capability of the model can be better enhanced, and the capability of the network for coping with complex environments is improved; a Batch Normalization layer (BN for short) is used for normalizing the input of each layer to ensure that the input data distribution of each layer is stable, thereby accelerating the training; 3 x 3 convolutional layers, extracting image features from the local, having lower parameters while ensuring better performance; finally, all elements in the dense block are connected into a whole through jump connection, so that the features obtained by all the elements are utilized to the maximum degree, and the gradient dispersion phenomenon in the face detection network training process is also ensured.

Because a corrected Linear Unit (ReLU for short) has the advantages of simple operation, high calculation efficiency, fast signal response and the like, it is often used in various deep learning algorithms, but its advantage is only in the aspect of forward propagation, and because the ReLU function discards all negative values, it is easy to make the model output all zero and cannot train any more.

Based on the above situation, in the present embodiment, a Swish function is used as the activation function, and the mathematical form thereof is shown in formula (1), and compared with the ReLU function, the Swish function can push the output average value of the activation unit to 0, so as to achieve the effect of batch normalization and reduce the amount of calculation, that is, the output average value close to 0 can reduce the shift effect, thereby making the gradient close to the natural state.

Further, since the face detection network has pooling operation (i.e., downsampling), in the process of convolution of an input image, the size of the input image is reduced in stages through the pooling operation, which is consistent with the scaling of the image pyramid, in this embodiment, the feature map before the size of each image is changed is output through the feature pyramid technology, and the coarse features and the fine features are fused, so that the purpose of detecting both a large face and a small face can be achieved, and the effect of the image feature pyramid technology can be basically achieved, but the overall processing time is shorter, and the specific structure is as shown in fig. 4. As can be seen from fig. 4, there are two main types of pre-processing, which are 1 × 1 convolution and 2 × upsampling (2 times upsampling), respectively; the purpose of 1x1 convolution is to increase the number of image channels, the number of image channels increases during pooling, the number of image channels must be the same for both images to enable feature fusion, and 2x upsampling is to increase the image size, which must be the same for both images to enable feature fusion due to the reduction in image size during pooling.

And S102, acquiring a face image training set.

And S103, training the face detection network by using the face image training set.

In the embodiment, a public face data set CASIA-Webface is used for training a face detection network, 40 ten thousand images are used as a face image training set, and 1 ten thousand images are used as a face image test set for subsequent testing.

Further, as shown in fig. 5, the step S103 specifically includes:

and S1031, dividing the face image training set into a plurality of batches of face images.

In this embodiment, the face image training set is divided into 12500 batches of face images, that is, 32 face images in each batch.

S1032, setting a plurality of periods, and sequentially inputting the face images of each batch into the face detection network for training in each period so as to finish the training of the face detection network in the plurality of periods.

In the embodiment, 400 periods are set, and an Adam algorithm optimizer is adopted, in each period, 32 face images in each batch are obtained during training, the weight attenuation is set to be 0.0001, the initial learning rate is set to be 0.001, and the learning rate is reduced by 90% after each period of 100.

The face detection is essentially a combination of image classification (the invention classifies faces and non-faces) and regression box (a rectangular box containing the whole face) position determination, and if the image classification and the regression box position determination are trained independently, the algorithm cannot obtain good face detection capability. Therefore, the loss function used by the invention is a multitask loss function, the image classification and the position determination of the regression frame are trained in the same loss function, and the function of joint debugging of the whole network can be achieved, and the specific expression is shown as the following formula (2):

L(p,u,t ^u ,v)＝αL _cls (p,u)+L _loc (t ^u ,v)

wherein, L (p, u, t) ^u V) represents an overall loss value of the face detection network; l is _cls Represents the classification loss value, L _cls (p,u)＝-log(p _u )；L _loc The regression box position loss value is represented,

p＝(p ₀ ...p _k ) Representing the probability value of each category of the detected object predicted by the face detection network, wherein k represents that the classified objects are a face and a non-face, and k is 2 in the embodiment; u represents the category of the artificial mark, which is a human face in this embodiment; />

Respectively representing the x and y values of the upper left corner of the face frame containing the detected object and the width and the length of the regression frame; v = (v) _x ,v _y ,v _w ,v _h ) Indicating the position information of the artificially marked face, v _x ,v _y ,v _w ,v _h Respectively representing the x and y values of the upper left corner of the human face frame marked by the human hand and the width and the length of the regression frame; α is a weight, and α is 2 in this embodiment.

And S104, carrying out face detection on the image to be detected by using the trained face detection network to obtain a face detection result.

Further, the step S104 specifically includes:

s1041, inputting the image to be detected into the trained face detection network, sequentially processing the image through the first dense block and the second dense block, and outputting to obtain a first image characteristic.

In the embodiment, the face image in the face image test set is used as the image to be detected, the image to be detected is input into the trained face detection network, and is processed by the first dense block and the second dense block in sequence and output to obtain the first image characteristic.

And S1042, inputting the first image characteristic into the first pooling layer for down-sampling, sequentially processing the first image characteristic by the third dense block, the fourth dense block and the fifth dense block after down-sampling, and outputting to obtain a second image characteristic.

And S1043, inputting the second image characteristic into a second pooling layer for down-sampling, sequentially processing the second image characteristic by a sixth dense block, a seventh dense block and an eighth dense block after the down-sampling, and outputting to obtain a third image characteristic.

And S1044, performing feature fusion on the second image features and the second image features by using the first fusion module, and outputting to obtain fourth image features.

And S1045, performing feature fusion on the first image feature and the fourth image feature by using a second fusion module, and outputting to obtain a face detection result.

The steps S101 to S103 are off-line, i.e., training, phases, and are composed of three major parts, i.e., a construction generator network, a construction discriminator network, and a training generation antagonistic neural network, and the step S104 is on-line, i.e., application. It can be understood that the steps S101 to S103 are completed in one computer device (e.g., a computer), the application stage of the step S104 can be performed on the computer device, or the face detection network trained by the computer device can be transplanted to another computer device (e.g., a small embedded device such as a mobile phone and a tablet computer, or a device with a small memory), and the application stage of the step S104 can be performed on another computer device.

CMS-RCNN (context Multi-Scale Region-based convolutional Neural Network), SSH (Single Stage Headless Face Detector, single Shot Scale invariant Face Detector), and the Face detection Network of the present embodiment are tested on the WIDER Face (the data set is divided into a simple image set, a medium image set, and a difficult image set according to the size of the Face, the side Face degree, the occlusion condition, etc.) Face detection reference, and the test accuracy is as shown in the following Table 1.

TABLE 1 test accuracy for each model

Model (model)	Simple image set	Intermediate image set	Difficult image set
				CMS-RCNN	89.9％	87.4％	62.9％
SSH	91.9％	90.7％	81.4％
				S3FD	93.1％	92.1％	84.5％
Face detection network of the embodiment	93.8％	93.1％	86.4％

As can be seen from table 1, the test accuracy of the face detection network in this embodiment on three image sets is higher than that of the other three models, so that faces can be effectively detected.

Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program to instruct associated hardware, and the corresponding program may be stored in a computer-readable storage medium.

It should be noted that although the method operations of the above-described embodiments are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the depicted steps may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Example 2:

as shown in fig. 7, the present embodiment provides a face detection system based on a feature pyramid and a dense block, the system includes a construction unit 701, an acquisition unit 702, a training unit 703 and a detection unit 704, and specific functions of each unit are as follows:

the constructing unit 701 is configured to construct a face detection network according to the dense block, the pooling layer, and the feature fusion module based on the feature pyramid.

The acquiring unit 702 is configured to acquire a face image training set.

The training unit 703 is configured to train a face detection network by using a face image training set.

The detecting unit 704 is configured to perform face detection on the image to be detected by using the trained face detection network, so as to obtain a face detection result.

Further, as shown in fig. 8, the training unit 703 specifically includes:

a dividing subunit 7031, configured to divide the face image training set into multiple batches of face images.

Training subunit 7032 sets a plurality of periods, and inputs each batch of face images into the face detection network in sequence for training at each period, so as to complete training of the face detection network at the plurality of periods.

The specific implementation of each unit in this embodiment may refer to embodiment 1, which is not described herein any more; it should be noted that the system provided in this embodiment is only illustrated by the division of the functional units, and in practical applications, the above functions may be distributed by different functional units according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.

Example 3:

the present embodiment provides a computer device, which may be a computer, as shown in fig. 9, and includes a processor 902, a memory, an input device 903, a display 904, and a network interface 905 connected by a system bus 901, where the processor is used to provide computing and control capabilities, the memory includes a nonvolatile storage medium 906 and an internal memory 907, the nonvolatile storage medium 906 stores an operating system, computer programs, and a database, the internal memory 907 provides an environment for the operating system and the computer programs in the nonvolatile storage medium to run, and when the processor 902 executes the computer programs stored in the memory, the face detection method of the above embodiment 1 is implemented as follows:

acquiring a face image training set;

training a face detection network by using a face image training set;

and carrying out face detection on the image to be detected by using the trained face detection network to obtain a face detection result.

Further, the training of the face detection network by using the face image training set specifically includes:

dividing a face image training set into a plurality of batches of face images;

Example 4:

the present embodiment provides a storage medium, which is a computer-readable storage medium, and stores a computer program, where when the computer program is executed by a processor, the computer program implements the face detection method of embodiment 1, as follows:

acquiring a face image training set;

training a face detection network by using a face image training set;

dividing a face image training set into a plurality of batches of face images;

The storage medium described in this embodiment may be a magnetic disk, an optical disk, a computer Memory, a Random Access Memory (RAM), a usb disk, a removable hard disk, or other media.

In conclusion, the invention has less parameters and more stable training, uses the dense block as the main module of the face detection network, and can realize high-quality feature extraction under the condition of less network layers by using the advantage of high-frequency feature multiplexing of the dense block, and also ensures the gradient disappearance phenomenon in the network training process; in addition, the human face detection time is short, and the image pyramid technology used in the traditional human face detection algorithm is replaced by the characteristic pyramid technology, so that the algorithm can realize a faster detection effect under the condition of unchanged detection effect, and can meet the application requirements of scenes with low requirements on smaller human faces.

The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the scope of the present invention.

Claims

1. A face detection method based on a feature pyramid and a dense block is characterized by comprising the following steps:

acquiring a face image training set;

training a face detection network by using a face image training set;

carrying out face detection on an image to be detected by using the trained face detection network to obtain a face detection result;

in the face detection network, eight dense blocks are provided, two pooling layers are provided, two feature fusion modules are provided, the eight dense blocks are respectively a first dense block, a second dense block, a third dense block, a fourth dense block, a fifth dense block, a sixth dense block, a seventh dense block and an eighth dense block, the two pooling layers are respectively a first pooling layer and a second pooling layer, and the feature fusion modules are respectively a first feature fusion module and a second feature fusion module;

the input of the second feature fusion module is respectively connected with the output of the second dense block and the output of the first feature fusion module;

the method for detecting the face of the image to be detected by using the trained face detection network to obtain the face detection result specifically comprises the following steps:

performing feature fusion on the first image feature and the fourth image feature by using a second fusion module, and outputting to obtain a face detection result;

the training loss function of the face detection network is as follows:

L(p,u,t ^u ,v)＝αL _cls (p,u)+L _loc (t ^u ,v)

wherein, L (p, u, t) ^u V) represents the overall loss value of the face detection network; l is a radical of an alcohol _cls Represents the classification loss value, L _cls (p,u)＝-log(p _u )；L _loc The regression box position loss value is represented,

Respectively representing the x and y values of the upper left corner of the face frame containing the detected object and the width and the length of the regression frame; v = (v) _x ,v _y ,v _w ,v _h ) Indicating the position information of the artificially marked face, v _x ,v _y ,v _w ,v _h Respectively representing the x and y values of the upper left corner of the human face frame marked by the human hand and the width and the length of the regression frame; α is a weight. />

2. The method of claim 1, wherein the training of the face detection network using the face image training set specifically comprises:

dividing a face image training set into a plurality of batches of face images;

3. The face detection method according to any one of claims 1-2, wherein the dense block comprises a plurality of elements of the same type, the plurality of elements are connected in sequence, and the plurality of elements are connected into a whole through jump connection;

each element comprises a 1 multiplied by 1 convolutional layer, a first Swish function layer, a first batch normalization layer, a 3 multiplied by 3 convolutional layer, a second Swish function layer and a second batch normalization layer which are sequentially connected, wherein the 1 multiplied by 1 convolutional layer is used for compressing output data of an upper layer and reducing dimensions.

4. The method of any of claims 1-2, wherein the feature pyramid is preprocessed before feature fusion, the preprocessing including 1x1 convolution and 2x upsampling.

5. A system for detecting a face based on a feature pyramid and a dense block, the system comprising:

the acquisition unit is used for acquiring a face image training set;

the detection unit is used for carrying out face detection on the image to be detected by utilizing the trained face detection network to obtain a face detection result;

the training loss function of the face detection network is as follows:

L(p,u,t ^u ,v)＝αL _cls (p,u)+L _loc (t ^u ,v)

wherein, L (p, u, t) ^u V) represents the overall loss value of the face detection network; l is _cls Represents the classification loss value, L _cls (p,u)＝-log(p _u )；L _loc The value of the position loss of the regression box is represented,

Respectively representing the x and y values of the upper left corner of a frame containing the detected object face and the width and the length of a regression frame; v = (v) _x ,v _y ,v _w ,v _h ) Indicating the position information of the artificially marked face, v _x ,v _y ,v _w ,v _h Respectively representing the x and y values of the upper left corner of the human face frame marked by the human hand and the width and the length of the regression frame; α is a weight.

6. A computer device comprising a processor and a memory for storing processor-executable programs, wherein the processor, when executing a program stored in the memory, implements the face detection method of any one of claims 1-4.

7. A storage medium storing a program which, when executed by a processor, implements the face detection method of any one of claims 1 to 4.